How to convince a high school student that differentials don't work like fractions in general?

It all started when I tried to convince a 10th grader that if $f$ is a function defined on $\mathbb{R}^n$ the differential is defined by:

$\large \displaystyle df = \frac{\partial{f}}{\partial{x_1}}dx_1 + \frac{\partial{f}}{\partial{x_2}}dx_2 + \cdots \frac{\partial{f}}{\partial{x_n}}dx_n$

and if $x_i = g_i(t)$ then:

$\large\displaystyle \frac{df}{dt} = \frac{\partial{f}}{\partial{x_1}}\frac{dx_1}{dt} + \frac{\partial{f}}{\partial{x_2}}\frac{dx_2}{dt} + \cdots \frac{\partial{f}}{\partial{x_n}}\frac{dx_n}{dt}$

As he's a 10th grader, he's supposed to think of $df$ as a small change in the value of $f$ caused by a small change in $(x_1,...,x_n)$.

I have defined $df$ for a differentiable function $f: \mathbb{R} \to \mathbb{R}$ in the following naive but intuitive way and he has happily accepted this definition:

$\large \displaystyle df = \lim_{\Delta{x} \to 0} \Delta{y}$ where $\large \Delta{y} = f'(x)\Delta{x} + \epsilon(\Delta{x})\Delta{x}$ and $\large \epsilon(\Delta{x})$ is a function of $\large \Delta{x}$ that compensates the error for turning $\large f'(x) = \displaystyle \lim_{\Delta{x} \to 0}\frac{\Delta{y}}{\Delta{x}}$ into an equality and by definition we have $\large \displaystyle \lim_{\Delta{x} \to 0}\epsilon(\Delta{x}) = 0$


Using that definition, I convinced him why the differential of a multivariable function is generalized to higher dimensions that way. But I failed to convince him why it's not a good idea to cancel $\partial{x_i}$ in the denominator with $dx_i$ just like we're dealing with fractions. I'm also afraid of proving the chain rule for him by dividing $\Delta{t}$ and then letting $\Delta{t} \to 0$. I'm looking for an easy explanation, suitable for a high school student, that convinces him why differentials shouldn't be looked at as fractions contrary to what many students think in high school.


A standard example is the equation $PV = T$. Note that

$$P = \frac{T}{V} \implies \frac{\partial P}{\partial V} = -\frac{T}{V^2}$$ $$V = \frac{1}{P}T \implies \frac{\partial V}{\partial T} = \frac{1}{P}$$ $$T = PV \implies \frac{\partial T}{\partial P} = V$$ so $$\frac{\partial P}{\partial V} \frac{\partial V}{\partial T} \frac{\partial T}{\partial P} = -\frac{T}{V^2}\frac{1}{P}V = -\frac{T}{PV} = -1.$$


Edit: There's also the chain rule. If $f$ is a function of two variables, say $f(u,v)$, where both $u$ and $v$ are themselves functions of two variables (say $u=u(x,y)$ and $v=v(x,y)$), then the chain rule is

$$\frac{\partial f}{\partial x} = \frac{\partial f}{\partial u}\frac{\partial u}{\partial x} + \frac{\partial f}{\partial v}\frac{\partial v}{\partial x}.$$

If we could just cancel the $\partial u$'s and $\partial v$'s, we'd get the absurd $\frac{\partial f}{\partial x} = \frac{\partial f}{\partial x} + \frac{\partial f}{\partial x}$.

Admittedly, this is not the most conceptual explanation, but I imagine it'll convince quite a few high school (and college) students.


I think you can use the following strategy:

Your differential

$df = f_1 dx_1 + \ldots f_n dx_n$

Shows how $f$ changes to small changes in the coordinates. However, these coordinates can change independently of one another, so it is important to reason about how much each one is changing by...

In the formula

$df/dt = f_1 dx_1/dt + \ldots + f_n dx_n/dt$,

Note you are asking how $f(x(t))$ changes with $t$. However, for $x=g(t)$, each direction is changing at different rates described by $dx_i/dt$. So what would one be cancelling anyway?

As an example, you can use different examples of $g$ to illustrate this point. Use $g$ that only changes in the $x_i$ direction. When that happens, every other term but $f_i dx_i/dt$ disappears.

The point is, when you generalize to higher dimensions, you have to consider each independent variable separately for the differential. In fact, he might have asked from the beginning why not $df = n \partial f$, whatever that means? Hope this helps point in a right direction.


Edit: Adding along to this theme, it might be illuminating to show him how directional derivatives work, because it again illustrates the same point.


You seem to have two problems:

  • Your student is being misled by traditional but awful notation.
  • You want a nice conceptual explanation for the chain rule, rather than a nasty technical one.

I think both problems can be solved by taking a more modern approach.


I like the route Evan and John M suggested, introducing differentials via directional derivatives. I usually define $df(v)$ as the rate at which $f$ changes when you move through the domain with velocity $v$. I find that students are generally happy to accept this intuitive definition without more details. It's pretty obvious that $df(\alpha v)$ should always be equal to $\alpha df(v)$. Moreover, if $f$ is "nice enough," then $df(v + w) = df(v) + df(w)$. Notice that, in the Fréchet approach, this additivity property isn't proven from other facts—it's part of the definition of "nice enough"!

If we have a basis $e_1, \ldots, e_n$ for the domain of $f$, it's often useful to compute $df(v)$ in coordinates as $$df(v) = df(v_1 e_1 + \ldots + v_n e_n)$$ $$= v_1 df(e_1) + \ldots + v_n df(e_n).$$ This is the first expression in your question, rewritten in more coherent notation—in particular, there's nothing that looks like a fraction.

Now, suppose we want to find the rate of change of $f$ as we move along a path $\gamma(t)$. If our "velocity through time"—the rate at which the clock is running—is $\epsilon$, then our velocity through space is $d\gamma(\epsilon)$, so the rate of change of $f$ is $$df(d\gamma(\epsilon)).$$ Expanding $\gamma$ in coordinates as $\gamma_1 e_1 + \ldots + \gamma_n e_n$, we get $$df(d\gamma(\epsilon)) = df(d\gamma_1(\epsilon) e_1 + \ldots + d\gamma_n(\epsilon) e_n)$$ $$= df(e_1) d\gamma_1(\epsilon) + \ldots + df(e_n) d\gamma_n(\epsilon),$$ your second equation—again without anything that looks like a fraction.


If the basis $e_1, \ldots, e_n$ comes from a coordinate system $x_1, \ldots, x_n$, it's a nice exercise to figure out that $dx_\mu(v_1 e_1 + \ldots + v_n e_n) = v_\mu$, so $$df(v) = v_1 df(e_1) + \ldots + v_n df(e_n)$$ $$= df(e_1) dx_1(v) + \ldots + df(e_n) dx_n(v).$$ We can then write $$df = df(e_1) dx_1 + \ldots + df(e_n) dx_n,$$ with the understanding that the argument is supposed to distribute over all the terms on the right-hand side.

If you want to turn your lesson into a Victorian-era costume drama, you can introduce the shorthand $\tfrac{\partial f}{\partial x_\mu} = df(e_\mu)$, yielding the familiar coordinate expression for $df$. I would stress that, like giant sideburns and tight-laced corsets, this fraction-like notation is not necessarily meaningful, or particularly good for you.