Proof of this fairly obscure differentiation trick?

Suppose we're tying to differentiate the function $f(x)=x^x$. Now the textbook method would be to notice that $f(x)=e^{x \log{x}}$ and use the chain rule to find $$f'(x)=(1+\log{x})\ e^{x \log{x}}=(1+\log{x})\ x^x.$$

But suppose that I didn't make this observation and instead tried to apply the following differentiation rules:

$$\frac{d}{dx}x^c=cx^{c-1} \qquad (1)\\ \frac{d}{dx}c^x = \log{c}\ \cdot c^x \quad (2)$$

which are valid for any constant $c$. Obviously neither rules are applicable to the form $x^x$ because in this case neither the base nor the exponent are constant. But if I pretend that the exponent is constant and apply rule $(1)$, I would get $f'(x)\stackrel{?}{=}x\cdot x^{x-1}=x^x.$ Likewise, if I pretend that the base is constant and apply rule $(2)$ I obtain $f'(x)\stackrel{?}{=}\log{x}\cdot x^x$.

It isn't hard to see that neither of the derivatives are correct. But here's where the magic happens: if we sum the two “derivatives” we end up with $$x^x+ \log{x}\cdot x^x=(1+\log{x})\ x^x$$ which is the correct expression for $f'(x)$.

This same trick yields correct results in other contexts as well. In fact, in some cases it turns out to be a more efficient way of taking derivatives. For example, consider $$g(x)=x^2 = \color{blue} x\cdot \color{red} x.$$ If we pretend the blue $\color{blue} x$ is a constant we would get $g'(x)\stackrel{?}{=}\color{blue}x\cdot 1=x$. Now if we pretend the red $\color{red}x$ is constant we get $g'(x)\stackrel{?}{=}1\cdot \color{red} x=x$. Summing both expressions we end up with $2x$ which is of course a correct expression for the derivative.

These observations have led me to the following conjecture:

Let $f(x,y)$ be a differentiable function mapping $\mathbb{R}^2$ to $\mathbb{R}.$ Let $f'_1 (x,y)=\frac{\partial}{\partial x} f(x,y)$ and $f'_2 (x,y)=\frac{\partial}{\partial y} f(x,y)$. Then for any $t$ we have: $$\frac{d}{dt}f(t,t)=f'_1 (t,t) + f'_2 (t,t).$$

(I apologise for the somewhat awkward notation which I could not seem to get around without causing undue ambiguity.)

This formulation also seems to lend itself to following generalisation:

Let $f:\mathbb{R}^N \to \mathbb{R}$ be a function differentiable in each of its variables $x_1,x_2,\ldots,x_N$. For $n=1,2,\ldots,N$ define $f'_n(x_1,x_2,\ldots,x_N)=\frac{\partial}{\partial x_n}f(x_1,x_2,\ldots,x_N)$. Let $t$ be any real number and define the $N$-tuple $T=(t,t,\ldots,t)$. Then one has: $$\frac{d}{dt} f(T)=\sum_n f'_n(T).$$

Thus my question is:

  • Is this true?
  • How can it be proven? (Specifically in the case $N=2$ but also in the general case.)

Solution 1:

Your observation is true and follows from the multivariable chain rule. To see why, let $f \colon \mathbb{R}^2 \rightarrow \mathbb{R}$ be differentiable and let $\gamma \colon \mathbb{R} \rightarrow \mathbb{R}^2$ be a differentiable curve. Set $\gamma(t) = (\gamma_1(t),\gamma_2(t))$ and consider the composition $h(t) = f(\gamma(t))$ which is a differentiable function from $\mathbb{R}$ to $\mathbb{R}$. The chain rule implies that

$$ h'(t) = \frac{d}{dt} f(\gamma_1(t),\gamma_2(t)) = \frac{\partial f}{\partial x}(\gamma(t)) \cdot \gamma_1'(t) + \frac{\partial f}{\partial y}(\gamma(t)) \cdot \gamma_2'(t). $$

If we take $\gamma(t) = (t,t)$, we get your observation and this obviously generalizes for arbitrary $N$.

A direct proof is also possible using the definition of differentiability. Write $$f(x,y) = f(t_0,t_0) + \frac{\partial f}{\partial x}(t_0,t_0)(x - t_0) + \frac{\partial f}{\partial y}(t_0,t_0)(y - t_0) + r(x,y)$$

where

$$ \lim_{(x,y) \to (t_0,t_0)} \frac{r(x,y)}{\sqrt{(x - t_0)^2 + (y - t_0)^2}} = 0 $$

and then

$$ \frac{f(t,t) - f(t_0,t_0)}{t - t_0} = \frac{\partial f}{\partial x}(t_0,t_0) + \frac{\partial f}{\partial y}(t_0,t_0) + \frac{r(t,t)}{t - t_0} \xrightarrow[t \to 0]{} \frac{\partial f}{\partial x}(t_0,t_0) + \frac{\partial f}{\partial y}(t_0,t_0). $$


BTW, I agree with calling your observation "a trick" but I wouldn't call it obscure. In fact, it is useful in various contexts. For example, in differential geometry this is useful in proving that the lie bracket of two vector fields measures how an infinitesimal parallelogram obtained from the flows fails to close or how the curvature contributes to parallel transport along a closed loop. In both cases, one defines a function $f \colon (-\varepsilon, \varepsilon)^4 \rightarrow V$ which depends on four parameters (so $f = f(t_1,t_2,t_3,t_4)$) and one wants to compute the second derivative of $h(t) = f(t,t,t,t)$ at $t = 0$. Applying the chain rule, we have

$$ h''(0) = \sum_{i,j} \frac{\partial^2 h}{\partial t_i \partial t_j}(0,0,0,0) $$

and then one uses various symmetries to compute the partial derivatives. For more details, see here.

Solution 2:

Let's write $f(x, y)=x^y$. You want to find the derivative of the single-variable function $g(x)=f(x, x)$.

$$f(x+h, x+h)-f(x, x)=\big(f(x+h, x+h)-f(x+h,x)\big)+\big(f(x+h, x) - f(x, x)\big)$$

In other words, we're moving diagonally by first moving up and then moving right. When you divide this equation by $h$ and let $h\to 0$, you get

$$g'(x)=\lim_{h\to0}\frac{f(x+h, x+h)-f(x+h,x)}{h} + \partial_1f(x, x)$$

Where by $\partial_1$ I mean the partial derivative with respect to the first variable. So far we've just used the definition of the derivative.

But what about that limit? It sure looks a lot like $\partial_2 f(x, x)$, but the problem is that the first variable is changing as $h$ tends to $0$. But as $h$ tends to zero, the first variable is tending to $x$, so we can basically just replace it with $x$, and then we'll have the definition of $\partial_2 f(x, x)$... right?

There are probably a few ways to do this, but this is one of the things the mean value theorem exists for: justifying intuitions about this sort of thing. By the mean value theorem, that ratio is equal to $\partial_1f(x+h, \xi_h)$ for some $\xi_h$ in between $x$ and $x+h$. As $h$ tends to zero, that tends to $\partial_1 f(x, x)$... because $\partial_1 f(x, y)$ is a continuous (multi-variable) function (which is something that needs to be proven separately).

This is typical reasoning in basic multi-variable calc - you go from $A$ to $B$ one coordinate a time, apply single variable calc along each axis, and then use something like the mean value theorem to prove that what feels like it should work actually does work.