chain rule using tree diagram, why does it work?

Solution 1:

The point of derivatives in one variable is to provide linear approximations $f(x) = f(p) + f'(p) (x - p) + o(|x - p|)$ to nice functions. Multivariate derivatives work the same way, except "linear approximation" here means approximation by a general linear transformation (a matrix) instead of a scalar.

This is made precise by the following definition: we say that a function $f : \mathbb{R}^n \to \mathbb{R}^m$ has total derivative a linear transformation $df_p : \mathbb{R}^n \to \mathbb{R}^m$ at a point $p$ if there exists $\epsilon > 0$ and a function $E_p(h)$ defined for $|h| < \epsilon$ such that

$$f(p + h) = f(p) + df_p(h) + |h| E_p(h)$$

where $\lim_{h \to 0} E_p(h) = 0$. The matrix $df_p$ is sometimes called the Jacobian. In little-o notation, we write this

$$f(p + h) = f(p) + df_p(h) + o(|h|).$$

This might seem unnecessarily complicated, but it is the key to understanding the multivariate chain rule. Suppose that in addition to $f$ we have another function $g : \mathbb{R}^m \to \mathbb{R}^k$ with a total derivative $dg_q$ at some point $q$, and suppose that $f(p) = q$. Then

$$gf(p + h) = g \left( f(p) + df_p(h) + o(|h|) \right) = gf(p) + dg_q df_p(h) + o(|h|)$$

or, in other words,

The total derivative $d(gf)_p$ of $gf$ at $p$ is the (matrix) product of the total derivatives $dg_q$ and $df_p$.

This is the most general statement of the multivariate chain rule. The relationship to tree diagrams is that one can model matrix multiplication using composition of incidence matrices, which come from graphs depicting incidence relationships between sets.

In your particular example, you have a function $t \mapsto (x, y) : \mathbb{R}^1 \to \mathbb{R}^2$ and another function $(x, y) \mapsto z : \mathbb{R}^2 \to \mathbb{R}^1$. The total derivative of the first function is $\left[ \begin{array}{c} \frac{dx}{dt} \\\ \frac{dy}{dt} \end{array} \right]$ and the total derivative of the second function is $\left[ \frac{dz}{dx}, \frac{dz}{dy} \right]$, so the total derivative of their composition is the product

$$\frac{dz}{dt} = \left[ \frac{dz}{dx}, \frac{dz}{dy} \right] \left[ \begin{array}{c} \frac{dx}{dt} \\\ \frac{dy}{dt} \end{array} \right]$$

and this is precisely the formula you give. The connection to diagrams is that one can represent a composition of linear transformations $\mathbb{R}^1 \to \mathbb{R}^2$ and $\mathbb{R}^2 \to \mathbb{R}^1$ using a pair of incidence matrices, one to represent incidences between a $1$-element set and a $2$-element set, and the other to represent incidences between that $2$-element set and another $1$-element set.

Solution 2:

Think about the differentiation as the derivatives along different axes.

So what you have is in essense $\frac{dz}{dt} = \frac{\partial z}{\partial x}|_y \frac{dx}{dt}+ \frac{\partial z}{\partial y}|_x \frac{dy}{dt}$

The sum exists when you are not travelling in either of those axes, then you are travelling along a path that is 'shared' by the two axes, and their sum tells you the gradient of that path.

Solution 3:

This video will certainly clarify things: http://www.youtube.com/watch?v=2bF6H_xu0ao.

Although it may take a bit longer, I personally find that computing the total differential is substantially easier and more intuitive than a tree diagram.