How can I "see" that calculus works for multidimensional problems?
Solution 1:
For the most general case, think about a mixing board.
Each input argument to the function is represented by a slider with an associated piece of a real number line along one side, just like in the picture. If you are thinking of a function which can accept arbitrary real number inputs, the slider will have to be infinitely long, of course, which of course is not possible in real life, but is in the imaginary, ideal world of mathematics. This mixing board also has a dial on it, which displays the number corresponding to the function's output.
The partial derivative of the function with respect to one of its input arguments corresponds to how sensitive the readout on the dial is if you wiggle the slider representing that argument just a little bit around wherever it's currently set - that is, how much more or less dramatic the changes in what is shown are compared to the size of your wiggle. If you wiggle a slider by, say, 0.0001, and the value changes by a factor 0.0002, the partial derivative with respect to that variable at the given setting is (likely only approximately) 2. If the value changes in an opposite sense, i.e. goes down when you move the slider up, the derivative is negative.
The gradient, then, is the ordered list of signed proportions by which you have to "wiggle" all the sliders so as to achieve the strongest possible, but still small, positive wiggle in the value on the dial. This is a vector, because you can think of vectors as ordered lists of quantities for which we can subject to elementwise addition and elementwise multiplication by a single number.
And of course, when I say "small" here I mean "ideally small" - i.e. "just on the cusp of being zero" which, of course, you can make formally rigorous in a number of ways, such as by using limits.
Solution 2:
Well, in one variable you need to solve $f'(x)=0$ and controll wheter it is maximum or a minimum. Such equation is called Euler equation and it holds also in more variables, but with the formulation: $$\nabla f(x)=0$$ where $\nabla f(x)=(\frac{\partial f(x)}{\partial x_{1}},\dots,\frac{\partial f(x)}{\partial x_{n}})$. Now, also here you should controll if it is a minimum or not and this is done via checking on the Hessian of $f$. That is, if $\bar{x}$ is such that $\nabla f(\bar{x})=0$ then $\bar{x}$ is a minimum if $Hf(\bar{x})$ admits only positive eigenvalues, where $Hf$ is the matrix made of second order partial derivatives of $f$. Notice that we are assuming $\bar{x}$ belongs to the interior of a set, exactly as for functions with one variable.
Solution 3:
Consider a single variable function $f(x)$, suppose there exists a maximum for $f(x)$ , then we can find that by the function's derivative has a sign flip at that point, then if we were to take a point $ x>x_o$
$$ \frac{df}{dx}|_{x>x_o} = \text{something negative}$$
Now, the magnitude of the derivative depends on how far you are from the global maximum, so suppose you take a 'step' on the $x$ axis scaled up by the derivative.
$$ \frac{df}{dx}_{x > x_o} \Delta x$$
Then, you will end up walking to the maximum. Now let's say you are at a point $x<x_o$ , then the first derivative is positive and you will still end up walking toward the maximum. Moral of the story? If you walk around the input set keeping your steps scaled up by the function's derivative, then you'll eventually hit a global maximum / minimum.[Edit: It may also turm up that you get stuck in local mini/global minimum :(]
Now, consider a multivariable $f(x,y)$ , by the logic above if it has a local max, say at a point $(x_o,y_o)$, if you take a $x>x_o$, then
$$ \frac{\partial f}{\partial x} = \text{something negative}$$
And similar argument to single variable case can be applied, and we can apply a similar argument for $y$. Ultimate this leads us to idea that the vector given as:
$$ \nabla F = < \frac{\partial F}{\partial x} , \frac{\partial F}{\partial y} >$$
Tells us how to move in the input plane such that our function is maximized.
So, say you are a point $<x_o,y_o>$ , then the point where you should move next to maximize the function is:
$$ <x,y> = <x_o,y_o> + < \frac{\partial F}{\partial x}|_{x_o} \Delta x, \frac{\partial F}{\partial y}|_{y_o} \Delta y>$$
Why? If $z(x,y)$
$$\Delta z= \frac{\partial F}{\partial x} \Delta x + \frac{\partial F}{\partial y} \Delta y= \nabla F \cdot ds$$
$ds$ is the length of step you take in the input plain, clearly for a fixed step length, the most increase in function happens when the angle between step and gradient is zero.
Hence, using that gradient vector as a compass to move, you'll finally reach some kind of extremum point in the input plane.
Solution 4:
$\def\p{\partial} \def\vr{{\bf r}} \def\th{\theta} \newcommand\pder[2][]{\frac{\partial #1}{\partial #2}} \newcommand\der[2][]{\frac{d #1}{d #2}}$Nontrivial geometrical intuition can be obtained by considering temperature as a function of position in three dimensions. Suppose temperature is given by $T(\vr)$, where $\vr = \langle x,y,z\rangle$. Surfaces of constant temperature ("equipotentials") are given by $T(\vr) = \mathrm{const}$. We wish to minimize the temperature starting from $\vr=\vr_0$ where $T(\vr_0)=T_0$, that is, we wish to move from this equipotential to another with lower temperature following the shortest possible path. We will argue that we must move in a direction orthogonal to the equipotential. In fact, we must move in the direction opposite that of the gradient of temperature.
Consider a path through $\vr_0$ parametrized by $t$. By multivariable chain rule, the rate of change of $T$ along the path is given by \begin{align*} \frac{d T(\vr(t))}{dt} = \pder[T]{x}\der[x]{t} +\pder[T]{y}\der[y]{t} +\pder[T]{z}\der[z]{t} = \nabla T\cdot\vr'(t),\tag{1} \end{align*} where $\nabla T = \langle \p T/\p x,\p T/\p y,\p T/\p z\rangle$ is the gradient of temperature. Assume that $|\vr'(t)|=1$, independent of path (that is, the paths are parametrized by arclength). The derivative $dT/dt$ is then a measure that can be used to compare the rate of change of $T$ for different paths through $\vr_0$. Note that $$\left|\pder[T(\vr(t))]{t}\right| = |\nabla T|\cos\th,$$ where $\th$ is the angle between $\nabla T$ and $\vr'(t)$. The magnitude of the rate of change is clearly maximized for $\th=0,\pi$. The value of $T$ does not change for paths with $\th=\pi/2$. These paths are tangential to the equipotential at $\vr_0$. Thus, $dT/dt$ is maximally negative if $\vr'$ is antiparallel to $\nabla T$ (and thus perpendicular to the equipotential). Since $\vr'$ is tangential to the path, we expect $T$ to decrease maximally in the direction given by $-\nabla T$. Thus, we let $\Delta\vr = -\varepsilon\nabla T$, where $\varepsilon>0$ is sufficiently small.
Example
If $T(\vr) = x^2+y^2+z^2$, the equipotentials are spheres and $\nabla T = 2\langle x,y,z\rangle = 2\vr$. Thus, $\Delta r = -2\varepsilon \vr_0$. This displacement is directly towards the origin, where the temperature has a minimum value of $T({\bf 0}) = 0$. See the figure below.
Figure 1. Equipotentials and displacement vector.
One dimension
Equation (1) and its interpretation generalize nicely to higher and lower dimensions. In one dimension we have $$\der[T(x(t))]{t} = \der[T]{x} \der[x]{t}.$$ Clearly this value is maximally negative if $dT/dx$ and $dx/dt$ are opposite in sign. Thus, we let $\Delta x = -\varepsilon dT/dx$.
Example
If $T(x) = x^2$, we find $\Delta x = -2\varepsilon x_0$. This displacement is again directly towards the minimum at $x=0$, independent of the sign of $x_0$.