Intuition behind Lagrange multiplier

I noticed that all attempts of showcasing the intuition behind Lagrange's multipliers basically resort to the following example (taken from Wikipedia): enter image description here

The reason why such examples make sense is that the level curves of the f function are either only decreasing (d1 < d2 < d3) or only increasing (d1 > d2 > d3) concentrically, so it's obvious that the most centric level curve touching the constraint curve is the minimum/maximum that we are looking for.

But in real life examples, I imagine we can have a function f whose level curves might look like this (they don't decrease/increase in an orderly fashion):

enter image description here

In the above example that I thought of, the maximization of f (subject to the g constraint) would not be the blue point (where the constraint curve is tangent to a level curve of f), but the two green points.

I haven't seen this kind of level arrangement in any tutorial/lecture/course and it just seems to me that every demonstration conveniently picks the most favorable scenario for presenting the intuition. I must be wrong somewhere but I can't figure out where.


For my part, I don’t find that way of explaining the Lagrange multiplier method particularly enlightening. Instead, I like to think of it in terms of directional derivatives. Suppose we have $f:\mathbb R^2\to\mathbb R$ and wish to find the critical points of $f$ restricted to the curve $\gamma$. If we “straighten out” $\gamma$, this becomes the familiar problem of finding the critical points of a single-variable function, which occur where its derivative vanishes. The analog of this condition in the original, non-straightened setting is that the directional derivative of $f$ in a direction tangent to $\gamma$ vanishes: loosely, $\nabla f\cdot \gamma'=0$. Now, the gradient of a function is normal to its level curves, so if $\gamma$ is given as the level curve of a function $g:\mathbb R^2\to\mathbb R$, then this condition is equivalent to the two gradients being parallel, i.e., $\nabla f=\lambda\nabla g$. This idea generalizes to higher dimensions: we have instead a level (hyper)surface of $g$ and want the derivative of $f$ in every direction tangent to $g$ to vanish, which is equivalent to the gradients of the two functions being parallel. For suitably well-behaved functions $g$, the above conceptual “flattening out” of the hypersurface is justified in a neighborhood of a point by the Implicit Function theorem.


I am imagining your proposed $f(x, y)$ to look like ripples on a pond, with the $f=4$ contour being a wave crest ring, and the $f=1$ contour being a wave trough ring.

The green points are unconditional local maxima for $f$. That is, they are optima even if we don't enforce $g=c$. At those points, the gradient of $f$ is the zero vector. The equation $\nabla f = \lambda \nabla g$ will still hold for $\lambda = 0 $, so your example hasn't broken anything. All that happens is that we lose the geometric statement "gradients are parallel" because the zero vector has no direction. This always occurs when an unconditional optima of $f$ happens to perfectly touch $g$.

The $f=3$ contour is more interesting, because I have not imagined it to necessarily be a crest or trough. It could just be part of a swell. Regardless, the blue point has to be listed as a candidate in the Lagrange multiplier method. The method says that all constrained optima will satisfy $\nabla f = \lambda \nabla g$ but it doesn't say that non-optima can't. It is just a necessary condition.

Anyway, I find the following to be a convincing argument that Lagrange multipliers work. It comes from Wikipedia:

The method of Lagrange multipliers relies on the intuition that at a maximum, $f(x, y)$ cannot be increasing in the direction of any neighboring point where $g = c$. If it were, we could walk along $g = c$ to get higher, meaning that the starting point wasn't actually the maximum.

(...i.e. if the gradient of $f$ had any component along $g=c$, it would be giving us a direction to walk, so optima must have $\nabla f$ orthogonal to $g=c$, or equivalently parallel to $\nabla g$).


If $f$ is continuous, this kind of arrangement can't happen. To see this, consider a continuous path going from the the $f=1$ curve to the $f=4$ curve. Let $(x(t), y(t))$ denote a (continuous) paramaterization of such a path with, say $(x(0), y(0))$ a point on the $f=1$ curve and $(x(1), y(1))$ a point on the $f=4$ curve.

Then we get a single variable continuous function by mapping $t$ to $f(x(t),y(t))$. If we plug in $t=0$, we get $f(x(0), y(0)) = 1$ and likewise, if we plug in $t=1$, we get $f(x(1), y(1)) = 4$. By the intermediate value theorem, there is some $t$ we can plug in for which $f(x(t),y(t)) = 3$.

In terms of your picture, what this says is that any continuous curve which starts at the $f=1$ curve and ends at the $f=4$ curve must pass through the $f=3$ curve. If you think about this, you should be able to convince yourself that this forces the curves $f=1$, $f=3$, and $f=4$ to be concentric, as it is in the pictures you typically see.

Edit: Let me try to address it a little differently so that you can see why the connectedness of the $f=3$ curve is irrelevant. The issue is that no one is claiming that a place where $f$ and $g$ are tangent is a max or min of $f$. Rather, the claim is the the only hope of finding a max or min of $f$ is at a point where $f$ and $g$ are tangent (for reasonbly smooth $f$ and $g$), or where $f$ already has a max or min.

Here is the argument (which is essentially what amd wrote in his/her answer) with a bit more hand waving.) Suppose $(x_0,y_0)$ is a point along the curve $g(x,y) = c$ for which $f(x_0,y_0)$ is a maximum. (I'll use notation for the value $d = f(x_0,y_0)$.)

The question is: how are $f$ and $g$ related at $(x_0,y_0)$?

Well, zoom in incredibly close to $(x_0,y_0)$. At this resolution, all you see is a very small part of the curve $g(x,y) = c$ and a very small portion of one connected piece of the curve $f = d$. The point is that while the curve $f=d$ may have infinitely many components, we are zooming in so far as to only see the a part of the one containing $(x_0,y_0)$.

Because we are zoomed in very close, for intuition purposes, we may as well replace $g$ and $f$ by their tangent lines. If these tangent lines are different, then the line $g = c$ passes from one side of the line $f = d$ to the other. If $f$ does not have a local min/max at $(x,y)$, then the values must be higher on one side of the $f$-line and lower on other. Since the line $g = c$ crosses the line $f = d$, by using points on one side or the other of the $f$-line, we can find points on $g=c$ which make $f$ even bigger than $d$. This is a problem: $d$ was supposed to be the maximum value!

So something had to go wrong. We have already identified both problems. First, maybe the tangent line to $f$ was equal to the tangent line to $g$. Then the $g$ line doesn't cross the $f$ line. Second, if $f$ has a local max at $(x,y)$ then points on $g(x,y) =c$ don't increase the size of $f$ as the $g$ line crosses the $f$ line. (Another thing that could go wrong: maybe $f$ just doesn't have a maximum subject to the constraint. This issue actually can't arise if the graph of $g = c$ doesn't go off to infinity in some direction.)