How to prove Lagrange multiplier theorem in a rigorous but intuitive way?

Following some text books, the Lagrange multiplier theorem can be described as follows.

Let $U \subset \mathbb{R}^n$ be an open set and let $f:U\rightarrow \mathbb{R}, g:U\rightarrow \mathbb{R}$ be $C^1$ functions. Let $\mathbf{x}_{0} \in U, c=g(\mathbf{x}_0)$, and let $S$ be the level set of $g$ with value $c$. Suppose that $\nabla g(\mathbf{x}_0) \neq 0$.
If the function $f|_S$ has a local extremum at $\mathbf{x}_0$ then there is a $\lambda \in \mathbb{R}$ so that $\nabla f(\mathbf{x}_0)=\lambda g(\mathbf{x}_0)$

Some textbooks give a strict proof using implicit function theorem (see proof). The proof is very clear, however I can't build some intuition from the proof. Can someone help me prove this theorem in a rigorous but more intuitive way? Thanks.


Here's some intuition.

First suppose that $x_0$ is a local minimizer for the problem \begin{align} \tag{$\spadesuit$} \text{minimize} & \quad f(x) \\ \text{subject to} &\quad Ax = b. \end{align} If $Au = 0$, then the directional derivative $D_u f(x_0) = \langle \nabla f(x_0),u \rangle$ is nonnegative. Otherwise, it would be possible to improve $x_0$ by perturbing it a bit in the direction $u$.

It follows that if $Au = 0$, then $\langle \nabla f(x_0),u \rangle = 0$.

So, $\nabla f(x_0)$ is in the orthogonal complement of the null space of $A$.

BUT, according to the "four subspaces theorem" (which is beloved by Gilbert Strang), the orthogonal complement of the null space of $A$ is the range of $A^T$. Thus, $\nabla f(x_0)$ is in the range of $A^T$. In other words, there exists a vector $\lambda$ such that \begin{equation} \nabla f(x_0) = A^T \lambda. \end{equation} This is our Lagrange multiplier optimality condition, in the case where we have linear equality constraints.

Next, suppose that $x_0$ is a local minimizer for the problem \begin{align} \text{minimize} & \quad f(x) \\ \text{subject to} &\quad g(x) = 0, \end{align} where $g:\mathbb R^n \to \mathbb R^m$.

It seems plausible that $x_0$ is also a local minimizer for the problem we get by replacing $g$ with the local linear approximation to $g$ at $x_0$: \begin{align} \text{minimize} & \quad f(x) \\ \text{subject to} &\quad g(x_0) + g'(x_0)(x - x_0) = 0. \end{align} (Here $g'(x_0)$ is the Jacobian matrix of $g$ at $x_0$.) Assuming that $x_0$ is indeed a local minimizer for this modified problem, it follows that $x_0$ satisfies the optimality condition derived above: \begin{align} \nabla f(x_0) &= g'(x_0)^T \lambda \\ &= \nabla g_1(x_0) \lambda_1 + \cdots + \nabla g_m(x_0) \lambda_m \end{align} for some vector $\lambda = \begin{bmatrix} \lambda_1 & \ldots & \lambda_m \end{bmatrix}^T \in \mathbb R^m$. (Here $g_1,\ldots,g_m$ are the component functions of $g$, and we use the convention that the gradient is a column vector.) This is our Lagrange multiplier optimality condition in the case of nonlinear equality constraints.

I believe it's possible to view the proof using the implicit function theorem as a rigorous version of this intuition.


Edit: Now I'll show how a similar approach can handle inequality constraints, if we replace the "four subspaces theorem" with Farkas' lemma (which I call the four cones theorem).

Let $x^*$ be a local minimizer for the problem \begin{align} \tag{$\heartsuit$} \text{minimize} & \quad f(x) \\ \text{subject to} & \quad Ax \leq b. \end{align} Suppose that $u \in \mathbb R^n$ and $Au \leq 0$. Then the directional derivative $D_uf(x^*) = \langle \nabla f(x^*),u \rangle \geq 0$. Otherwise, we could improve $x^*$ by perturbing it a bit in the direction of $u$.

At this point, we would like to invoke something like the four subspaces theorem, to find some optimality condition that $\nabla f(x^*)$ must satisfy. While the four subspaces theorem is inapplicable here, there is a one-sided version of the four subspaces theorem that fits this situation perfectly.

The four cones theorem

We first recall a few facts from convex analysis. If $C \subset \mathbb R^n$ is a cone, then the polar of $C$ is defined by \begin{equation} C^\circ = \{z \mid \langle x,z \rangle \leq 0 \,\, \forall \,\, x \in C \}. \end{equation} Note that $C^\circ$ is a closed convex cone.
A convex cone is a "one-sided" version of a subspace, and the polar cone is a "one-sided" version of the orthogonal complement. Instead of $\langle x,z \rangle = 0$, we have $\langle x, z \rangle \leq 0$. Here's some support for this analogy:

  • For a subspace $W$ of $\mathbb R^n$, we have the familiar equation $W = W^{\perp \perp}$. The one-sided version of this is that if $C$ is a closed convex cone, then $C = C^{\circ \circ}$.
  • If $W$ is a subspace of $\mathbb R^n$, then any $x \in \mathbb R^n$ can be decomposed as $x = \Pi_W(x) + \Pi_{W^\perp}(x)$. The one-sided version of this is that if $C \subset \mathbb R^n$ is a closed convex cone, then any $x \in \mathbb R^n$ can be decomposed as $x = \Pi_C(x) + \Pi_{C^\circ}(x)$. Moreover, $\Pi_C(x)$ is orthogonal to $\Pi_{C^\circ}(x)$.

Let $\mathcal R_+(A^T) = \{ A^T z \mid z \geq 0 \}$. So $\mathcal R_+(A^T)$ is a ``one-sided version'' of the column space of $A^T$. I call it the column cone of $A^T$. Let $ \mathcal N_-(A) = \{x \mid Ax \leq 0 \}$. So $\mathcal N_-(A)$ is a one-sided version of the null space of $A$. I call it the null cone of $A$. Note that $\mathcal N_-(A)$ and $\mathcal R_+(A^T)$ are both closed convex cones.

The four subspaces theorem can be stated as \begin{equation} \mathcal N(A)^\perp = \mathcal R(A^T). \end{equation} Here is our "one-sided" version of the four subspaces theorem: \begin{equation} \mathcal N_-(A)^\circ = \mathcal R_+(A^T). \end{equation} I call this the "four cones theorem".

Invoking the four cones theorem

We can use the "four cones theorem" to find an optimality condition for problem ($\heartsuit$), just as we used the four subspaces theorem to find an optimality condition for problem ($\spadesuit$). As noted previously, if $A u \leq 0$, then $\langle \nabla f(x^*), u \rangle \geq 0$. In other words, $-\nabla f(x^*) \in \mathcal N_-(A)^\circ$. By the four cones theorem, it follows that $-\nabla f(x^*) \in \mathcal R_+(A^T)$. Thus, there exists a vector $z \geq 0$ such that \begin{equation} \nabla f(x^*) + A^T z = 0. \end{equation} This, together with $Ax^* \leq b$, is our optimality condition (KKT condition) for the problem ($\heartsuit$) with linear inequality constraints.

Connection with Farkas's lemma

The four cones theorem can be stated as a ``theorem of the alternative''. If $g \in \mathbb R^n$, then $g$ either does or does not belong to $\mathcal R_+(A^T)$. In other words, either $g$ belongs to $\mathcal R_+(A^T)$, or else $g$ does not belong to the polar of $\mathcal N_-(A)$. This means that one and only one of the following two alternatives is true:

  1. There exists $z \geq 0$ such that $g = A^T z$.
  2. There exists $u$ such that $Au \leq 0$ and $\langle g, u \rangle > 0$.

This is equivalent to the statement of Farkas's lemma given in Boyd and Vandenberghe. The four cones theorem is Farkas's lemma.

Visualizing the four cones theorem

The "alternative" statement of the four cones theorem suggests a way of visualizing the four cones theorem that makes it seem obvious. The first alternative is that $g$ belongs to the convex cone generated by the columns of $A^T$. The second alternative is that there is a hyperplane through the origin that separates $g$ from the columns of $A^T$. (The vector $u$ is a normal vector to this separating hyperplane.)

Problem with linear equality and inequality constraints

This case is similar to the previous cases, except that we will use a different version of Farkas's lemma that involves a combination of equality and inequality constraints.


Look at the tangent space to $S = g^{-1}(c)$ at $x_0$, say $S_{x_0}$. Then $S_{x_0} = \left[\nabla g(x_0) \right]^{\perp}$. Therefore, $S_{x_0}^\perp$ is the one-dimensional subspace spanned by $\nabla g(x_0)$. Now, $$\nabla f(x_0) = \lambda \nabla g(x_0)\text{ for some } \lambda \in \Bbb{R} \iff \nabla f(x_0) \in S_{x_0}^\perp \iff \nabla f(x_0).v = 0 \text{ for all } v \in S_{x_0}.$$ But every $v \in S_{x_0}$ is of the form $v = \dot\alpha(t_0)$ for some parametrized curve $\alpha:I \to S$ and $t_0 \in I$ with $\alpha(t_0) =x_0$. Since $x_0$ is an extreme point of $f$ on $S$, $t_0$ is an extreme point of $f \circ \alpha$ on $I$. Hence $$0=(f \circ \alpha)'(t_0) = \nabla f(\alpha(t_0)).\dot\alpha(t_0) = \nabla f(x_0).v$$ for all $v \in S_{x_0}$ and so the result follows.

Caveat: Think where implicit function theorem is coming into this proof


Here's a different way to see that Lagrange multipliers are "intuitively" the correct way to go. To first order, the change in $f$ due to a change in $\vec r$ is simply:$$\delta f = f(\vec r + \delta\vec r) - f(\vec r) \approx \nabla f\cdot\delta\vec r.$$This has the obvious interpretation for $f$ above, "go in the direction of $\nabla f$ to increase $f$," but it also has an interpretation for constraints: to first order, the paths allowed by a constraint $g(\vec r) = c$ must be the ones such that $\nabla g \cdot \delta \vec r = 0,$ otherwise you would increase $c$ to first-order by following that tangent vector. That is, if you obey the constraint, we must have $\delta g = 0.$

Now we simply combine these two. At the constrained extremum of $f$ on $S,$ it must be the case that for all $\delta\vec r$ such that $\nabla g \cdot \delta \vec r = 0,$ we also have $\delta f = \nabla f\cdot\delta\vec r = 0,$ otherwise we can follow $\pm \delta\vec r$ to increase/decrease $f$, proving that this wasn't either sort of extremum on $S$ after all. So just to phrase the very condition for a constrained maximum, we would have to write it as $$\forall \vec a . (\nabla g \cdot \vec a = 0) \rightarrow (\nabla f \cdot \vec a = 0)$$

But that "for-all" is 100% sufficient to prove that they are proportional! Effectively, you can't have two linearly independent vectors $\vec a, \vec b$ such that the nullspaces of $(\vec a \cdot)$ and $(\vec b \cdot)$ are the same. Let's just prove this last aspect to complete the proof.

Consider $\vec p = \vec a - \alpha \vec b$ for $\alpha = \vec a \cdot \vec b / (\vec b \cdot \vec b)$, this satisfies by construction $\vec b \cdot \vec p = 0$ and therefore we must have $\vec a \cdot \vec p = 0$ due to the shared nullspace. But when you chug through the algebra of this it means ultimately that $(\vec a \cdot \vec a)(\vec b \cdot \vec b) = (\vec a \cdot \vec b)^2.$ That guarantees the = sign in a triangle inequality looking like either of $$|\vec a \pm \vec b|^2 = |\vec a|^2 + |\vec b|^2 \pm 2 (\vec a \cdot \vec b) ~\le~ (|\vec a| + |\vec b|)^2 = |\vec a|^2 + |\vec b|^2 + 2 |\vec a||\vec b|.$$ We know that those equals signs only hold when the vectors are degenerate, that is, $\vec a = \lambda \vec b.$