I am looking at this proof: https://math.berkeley.edu/~nikhil/courses/121a/chain.pdf and have some confusion. I have seen several proof using little-o notation and am quite confused.

  1. Can someone explain the approximation of a differentiable function $g(x+\Delta x)=g(x)+\Delta x\cdot g'(x) +o(\Delta x)$. Why is this an equality? I think that as $\Delta x \to 0$ this becomes an equality by definition of the derivative, so why isn't there a limit?

  2. I really don't understand the composition step. For example, why can we write $\Delta y = g'(x)\Delta x + o(\Delta x)$. I have seen people do this step toher ways such as in this post Proof of multivariable chain rule (using $h,k$). If someone could explain this step that would very helpful.

  3. I have trouble extending the linear approximation in (1) in multiple dimensions. If $f:\mathbb R^m \to \mathbb R^n$, how can I approximate $f$ using the Jacobian?

Any help is appreciated!


Solution 1:

  1. The equality is equivalent to showing $g(x + \Delta x) - g(x) - \Delta x \cdot g’(x)$ is $o(\Delta x)$. Viewing the left-hand side as a function of $\Delta x$, we see that $$\lim_{\Delta x \to 0} \frac{g(x + \Delta x) - g(x) - \Delta x \cdot g’(x)}{\Delta x}= 0$$ by the definition of the derivative. Using the definition of little-o notation given in the second paragraph, this proves the (equivalent) equality.

  2. Since $y = g(x)$, we have $\Delta y = g(x + \Delta x) - g(x)$. We can now use the equality you asked about in (1) to see that $\Delta y = g’(x) \Delta x + o(\Delta x)$. I’m not sure what other questions you have about this step, but hopefully this will get you started.

  3. The basic idea is to “vectorize” everything. We essentially define the Jacobian $Df$ to satisfy the equation

$$g(\mathbf{x} + \Delta \mathbf{x}) = g(\mathbf{x}) + Dg \cdot \Delta \mathbf{x} + o(\| \Delta \mathbf{x}\|), $$ where now $\mathbf{x}$ and $\Delta \mathbf{x}$ are vectors in $\Bbb{R}^m$. When you write out the entries $Dg$, which are various partial derivatives, and do out the matrix multiplication, in each coordinate you’ll get the 1-dimensional chain rules as worked out above and page two of the notes.

Solution 2:

(1) The little-oh at the end is what makes it an equality. This expression is not an approximation of the function (only if you ignore the little-oh), it is the definition of derivative, in its most natural form. You have likely seen it defined as $$g'(x)=\lim_{\Delta x\rightarrow0}\frac{g(x+\Delta x)-g(x)}{\Delta x}$$ But this form of the definition only holds for real number valued functions (because for other types of inputs, the product/quotient is ill-defined). So a more suitable form is to write the statement as the incremental ratio minus some function of $\Delta x$, requiring of it only that its limit vanishes when $\Delta x \rightarrow0$. We then get the equality $$g'(x)=\frac{g(x+\Delta x)-g(x)}{\Delta x}-\rho(\Delta x)$$ That is to say you can always find an $\epsilon$>0 such that the derivative of a function is a $\delta=\rho(\Delta x)$ away from the incremental ratio (which is to say that the limit of the incremental ratio is the derivative, which was the original statement). We then can multiply everything by the incremental step, and organize the terms do get $$g(x+\Delta x)=g(x)+g'(x)\Delta x+\rho(\Delta x)\Delta x$$ We call the last term involving $\rho$ the little-oh and write it as you already are familiar. So hopefully you understood why it is an equality. As for your third question, I have laid the groundwork for the extension to $\mathbb{R}^m\rightarrow \mathbb{R}^n$ functions to be quite natural and obvious. A function is called differentiable at a point $x$ if there exists a unique linear map $df(x)$ such that $$g(x+\Delta x)=g(x)+df(x)\Delta x+\rho(\Delta x)\Delta x$$ This linear map is called the differential of the function at $x$, and is essentially the true generalized definition of derivative for such functions. It can be shown that this map has a corresponding matrix called the Jacobian matrix and that the components of that matrix are the partial derivatives of $f$, so most likely you'll see the last expression written as $$g(x+\Delta x)=g(x)+J_f(x)\Delta x+\rho(\Delta x)\Delta x$$ Where of course the argument is a vector and so there is a matrix vector product. This actually gives a very natural intuitive "proof" of the chain rule, because as you know the composition of two linear applications is the product of the two corresponding matrices.
I hope this has helped.