how to prove the chain rule?

Assuming everything behaves nicely ($f$ and $g$ can be differentiated, and $g(x)$ is different from $g(a)$ when $x$ and $a$ are close), the derivative of $f(g(x))$ at the point $x = a$ is given by $$ \lim_{x \to a}\frac{f(g(x)) - f(g(a))}{x-a}\\ = \lim_{x\to a}\frac{f(g(x)) - f(g(a))}{g(x) - g(a)}\cdot \frac{g(x) - g(a)}{x-a} $$ where the second line becomes $f'(g(a))\cdot g'(a)$, by definition of derivative.


One approach is to use the fact the "differentiability" is equivalent to "approximate linearity", in the sense that if $f$ is defined in some neighborhood of $a$, then $$ f'(a) = \lim_{h \to 0} \frac{f(a + h) - f(a)}{h}\quad\text{exists} $$ if and only if $$ f(a + h) = f(a) + f'(a) h + o(h)\quad\text{at $a$ (i.e., "for small $h$").} \tag{1} $$ (As usual, "$o(h)$" denotes a function satisfying $o(h)/h \to 0$ as $h \to 0$.)

If $f$ is differentiable at $a$ and $g$ is differentiable at $b = f(a)$, and if we write $b + k = y = f(x) = f(a + h)$, then $$ k = y - b = f(a + h) - f(a) = f'(a) h + o(h), $$ so $o(k) = o(h)$, i.e., any quantity negligible compared to $k$ is negligible compared to $h$. Now we simply compose the linear approximations of $g$ and $f$: \begin{align*} f(a + h) &= f(a) + f'(a) h + o(h), \\ g(b + k) &= g(b) + g'(b) k + o(k), \\ (g \circ f)(a + h) &= (g \circ f)(a) + g'\bigl(f(a)\bigr)\bigl[f'(a) h + o(h)\bigr] + o(k) \\ &= (g \circ f)(a) + \bigl[g'\bigl(f(a)\bigr) f'(a)\bigr] h + o(h). \end{align*} Since the right-hand side has the form of a linear approximation, (1) implies that $(g \circ f)'(a)$ exists, and is equal to the coefficient of $h$, i.e., $$ (g \circ f)'(a) = g'\bigl(f(a)\bigr) f'(a). $$ One nice feature of this argument is that it generalizes with almost no modifications to vector-valued functions of several variables.


As suggested by @Marty Cohen in [1] I went to [2] to find a proof. Under fair use, here I include Hardy's proof (more or less verbatim).

We write $f(x) = y$, $f(x+h) = y+k$, so that $k\rightarrow 0$ when $h\rightarrow 0$ and \begin{align} \label{eq:rsrrr} \dfrac{k}{h} \rightarrow f'(x). \quad \quad Eq. * \end{align} We must now distinguish two cases.

I. Suppose that $f'(x) \neq 0$, and that $h$ is small, but not zero. Then $k\neq 0$ because of Eq.~*, and \begin{align*} \dfrac{\phi(x+h) - \phi(x)}{h} &= \dfrac{F(y+k) - F(y)}{k}\dfrac{k}{h} \rightarrow F'(y)\,f'(x) \end{align*}

II. Suppose that $f'(x) = 0$, and that $h$ is small, but not zero. There are now two possibilities

II.A. If $k=0$, then \begin{align*} \dfrac{\phi(x+h) - \phi(x)}{h}&= \frac{F\left\{f(x+h)\right\}-F\left\{f(x )\right\}}{h} \\ &= \frac{F\left\{y\right\}-F\left\{y\right\}}{h} \\ &= \dfrac{0}{h} \\ &= 0 = F'(y)\,f'(x) \end{align*}

II.B. If $k\neq 0$, then \begin{align*} \dfrac{\phi(x+h) - \phi(x)}{h}&= \frac{F\left\{f(x+h)\right\}-F\left\{f(x )\right\}}{k}\,\dfrac{k}{h}. \end{align*} The first factor is nearly $F'(y)$, and the second is small because $k/h\rightarrow 0$. Hence $\dfrac{\phi(x+h) - \phi(x)}{h}$ is small in any case, and \begin{align*} \dfrac{\phi(x+h) - \phi(x)}{h}&\rightarrow 0 = F'(y)\,f'(x) \end{align*}

Bibliography

[1]Chain rule proof doubt

[2] G.H. Hardy, ``A course of Pure Mathematics,'' Cambridge University Press, 1960, 10th Edition, p. 217.