Proof of multivariable chain rule
Solution 1:
These days I've been looking for a rigurous proof of the multivariable chain rule and I've finally found one that I think is very easy to understand. I will leave it here (if nobody minds) for anybody searching for this that is not familiar with little-o notation, Jacobians and stuff like this. To understand this proof, all you need to know is the mean value theorem.
Let's say we have a function $f(x,y)$ and $x = x(t), y = y(t)$. Let's also take $z(t) = f(x(t), y(t))$ By definition, the derivative of z $z'(t)$ is
$$ z'(t) = \lim_{\Delta t \to 0}{\frac {f(x(t+\Delta t),y(t+\Delta t)) - f(x,y)}{\Delta t}}$$.
$$ Let \ \Delta x = x(t+\Delta t)-x(t),$$ $$\Delta y = y(t+\Delta t)-y(t)$$
Now I'll take the numerator of the fraction in the limit, and make a small change.
$$ f(x(t+\Delta t), y(t+\Delta t)) - f(x,y) = f(x+\Delta x, y+\Delta y) - f(x,y)$$ $$ = \left[f(x+\Delta x, y+\Delta y) - f(x+\Delta x, y)\right] + \left[f(x+\Delta x, y) - f(x, y)\right]$$
I have just added and substracted $f(x+\Delta x, y)$. For some reason, I will invert the terms.
$$ = \left[f(x+\Delta x, y) - f(x, y)\right] + \left[f(x+\Delta x, y+\Delta y) - f(x+\Delta x, y)\right]$$.
Now, let's define 2 functions and I will name them g and h. First,
$$ Let \ g(x) = f(x, y) \implies g'(x) = \frac {\partial f} {\partial x} $$.
Please note that y is constant here since g is a function of a single variable. Now, by the mean value theorem we have
$$ \exists c_1 \in (x, x+\Delta x) \ so \ that$$ $$\frac {g(x+\Delta x) - g(x)} {\Delta x} = g'(c_1) $$ $$ \Longleftrightarrow $$ $$ f(x+\Delta x, y) - f(x, y) = f_x(c_1, y)\Delta x$$
Similarly, using the function $$ h(y) = f(x + \Delta x, y) \implies h'(y) = \frac {\partial} {\partial y}f(x+\Delta x, y)$$
We will have by the same logic that
$$ f(x+\Delta x, y + \Delta y) - f(x+\Delta x, y) = f_y(x + \Delta x, c_2)\Delta y, c_2 \in (y, y+\Delta y) $$
Notice that $c_1$ and $c_2$ are bounded with respect to $\Delta x$ and $\Delta y$ So as $\Delta x \to 0, c_1 \to x$ and as $\Delta y \to 0, c_2 \to y$. By our definition of $\Delta x$ and $\Delta y$, as $\Delta t \to 0$, both $\Delta x$ and $\Delta y$ $\to 0$. So, as $\Delta t \to 0$, $c_1 \to x$ and $c_2 \to y$.
The last step of the proof is to sum this all up, divide by $\Delta t$ and take the limit as $\Delta t \to 0$
$$ f(x(t+\Delta t), y(t+\Delta t)) - f(x, y) = f_x(c_1, y)\Delta x + f_y(x+\Delta x, c_2)\Delta y $$ $$ \lim_{\Delta t \to 0} \frac {f(x(t+\Delta t), y(t+\Delta t))}{\Delta t} = \lim_{\Delta t \to 0} f_x(c_1, y)\frac {\Delta x}{\Delta t} + f_y(x+\Delta x, c_2)\frac {\Delta y}{\Delta t} = f_x(x, y)x'(t) + f_y(x, y)y'(t) \ QED $$
Edit: After a long time I've realised that this proof assumes that $f$ has partial derivatives defined on intervals around the point $(x, y)$ and they are continuous at the point. This is a sufficient condition for the function to be ($\mathbb{R}^2$-)differentiable at $(x, y)$, but it's not equivalent. Yet, the multivariable chain rule works for the function being just differentiable at that point. So for a general proof, one should first understand little-o notation as in the other answers.
Solution 2:
Multivariable chain rule descends from the theorem of composite function for function of several variables which states in general that if:
f and g are differentiable in $x_0$ and $y_0=f(x_0)$, that is:
$$f(x_0+h)=f(x_0)+J_f(x_0)\cdot h+o(|h|)$$
$$g(y_0+k)=g(y_0)+J_g(y_0)\cdot k+o(|k|)$$
The composite function $g \circ f$ is also differentiable in $x_0$ and:
$$g(f(x_0+h))=g(f(x_0))+J_g(y_0)\cdot J_f(x_0)\cdot h+o(|h|)$$
NOTE
For the proof it is convenient to write:
$o(|h|)=|h|\cdot \omega_f(h)$ with $\omega_f(h) \to 0$
$o(|k|)=|k|\cdot \omega_g(k)$ with $\omega_g(k) \to 0$.
In the special case of:
$$f: \mathbb{R} \to \mathbb{R^n} \quad t \to f(x_1(t),x_2(t),...,x_n(t))$$
$$g: \mathbb{R^n} \to \mathbb{R}$$
$$\phi=g\circ f: \mathbb{R} \to \mathbb{R}$$
we have
$$J_f(t)= \begin{bmatrix} \frac{dx_1}{d t} \\ .\\\frac{dx_n}{d t} \end{bmatrix} = \begin{bmatrix} x_1' \\ .\\ x_n' \end{bmatrix}$$
$$J_g(x)= \nabla g = \left( \frac{\partial g}{\partial x_1},...,\frac{\partial g}{\partial x_n} \right)$$
And finally:
$$J_g(x)\cdot J_f(t)=\frac{\partial g}{\partial x_1}\frac{dx_1}{dt}+...+\frac{\partial g}{\partial x_n}\frac{dx_n}{dt}$$
That is the chain rule for this particular case.
Take also a look here: Derivation of the multivariate chain rule
The general theorem allow to find similar rules for any case by the Jacobian matrices $J_f$ and $J_g$.