Understanding the concept behind the Lagrangian multiplier

I'll explain it in a three-dimensional setting. This means a priori you are given a "temperature" function $f:\,{\mathbb R}^3\to{\mathbb R}$.

Consider a fixed point ${\bf p}\in{\mathbb R}^3$ and a point $t\mapsto{\bf x}(t)$ $\ (t\geq0)$ moving away from ${\bf p}$ in direction ${\bf u}\,$; so ${\bf x}(0)={\bf p}$, $\,{\bf x}'(0)={\bf u}$. The temperature at the moving point is a function of $t$ and is given by $\phi(t):=f\bigl({\bf x}(t)\bigr)$. According to the chain rule we have $$\phi'(0)=\nabla f\bigl({\bf x}(0)\bigr)\cdot{\bf x}'(0)=\nabla f({\bf p})\cdot{\bf u}\ .\qquad(1)$$ If $\nabla f({\bf p})\ne{\bf 0}$ then there will always be direction vectors ${\bf u}$ such that the scalar product on the right is positive and other such vectors where it is negative. In this case the temperature $f$ can neither be locally minimal nor locally maximal at ${\bf p}$. In particular, for $|{\bf u}|=1$ the increase $\phi'(0)$ is maximal ($=|\nabla f({\bf p})|$) if ${\bf u}$ points in the direction of $\nabla f({\bf p})$, the temperature remains stationary, i.e., $\phi'(0)=0$, for directions ${\bf u}\perp\nabla f({\bf p})$, and decreases fastest in the direction $-\nabla f({\bf p})$.

When the moving point stays on the isothermal surface $S_f$ through ${\bf p}$ then $\phi'(0)=0$. Looking at $(1)$ we see that all "$f$-isothermal tangent directions" are orthogonal to $\nabla f({\bf p})$. This implies that $\nabla f({\bf p})$ spans the orthogonal complement of the tangent plane $T_{f,{\bf p}}$ of $S_f$.

Assume now that in addition to $f$ we have a constraint $g({\bf x})=0$ defining an "admissible" surface $S_g\subset{\mathbb R}^3$, and assume that our point ${\bf p}$ satisfies the constraint. In this case only "$g$-isothermal tangent directions" ${\bf u}$ are allowed for the moving point.

If ${\bf p}$ aspires to be a conditionally extremal point, whence a conditionally stationary point, of $f$ then we should have $\phi'(0)=0$ for all allowed directions. This means by $(1)$ that $\nabla f({\bf p})$ should be orthogonal to all "$g$-isothermal tangent directions", or that $\nabla f({\bf p})$ is in the orthogonal complement of the tangent plane $T_{g,{\bf p}}$. As the latter is spanned by $\nabla g({\bf p})$ (assumed $\ne{\bf 0}$) we would have an equality of the form $\nabla f({\bf p})=\lambda\, \nabla g({\bf p})$ for some $\lambda\in{\mathbb R}$. This in term implies that the point ${\bf p}$ will come to the fore when we solve the equations $$\nabla f({\bf x})=\lambda\, \nabla g({\bf x}),\quad g({\bf x})=0\qquad(2)$$ for ${\bf x}$ (and $\lambda$).

Applying the "Lagrange method" means no more and no less than solving the equations $(2)$.


The most intuitive way I've found to understand it is to think of the Lagrangian formulation as a computationally more tractable way of applying the implicit function theorem. It requires more mental effort than just pulling a lagrangian out of thin air and chugging through proofs, but it gets to the same place and ultimately I think it provides a better motivation.

The basic idea is this: you eliminate the constraint by composing the cost functional with the solution operator to the constraint equation. Then you can take derivatives and find extrema of the composed system by using the implicit function theorem. The system of equations you have to solve based on a naieve application of the implicit function theorem is wasteful, and a smart simplification to the system yields the lagrange multipliers in a natural way.

Suppose you have

  • a cost functional $F:X \rightarrow \mathbb{R}$,
  • a parameterized constraint function $G:Q \times X \rightarrow X$, ($q \in Q$ is the parameter)
  • a solution operator $S:Q \rightarrow X$ such that $G(q,S(q))=b$ for any given $q\in Q$,

and you want to find the sensitivity of $F$ with respect to the parameter $q$, going through the solution operator: $$\frac{d}{dq} F(S(q)).$$

In your case, the spaces and operators are

  • $X=\mathbb{R}^2$, $Q=\mathbb{R}^1$,
  • $F(x_1,x_2)=f(x_1,x_2)$,
  • $G(q,(x_1,x_2))=(g(x_1,x_2), q-x_1)^T$
  • $b=(c,0)^T$.

The implicit function theorem tells us how to compute this total derivative in terms of partial derivatives of things we know: $$\frac{d}{dq} F(S(q)) \cdot \delta q = F^{'}_x(S(q)) \circ [G^{'}_x (q,S(q))]^{-1} \circ G^{'}_q (q,S(q)) \cdot \delta q.$$

This is theoretically all you need to find the critical points for a smooth optimization problem, but there is a major issue: to use it you need to compute the matrix inverse $[G^{'}_x (q,S(q))]^{-1}$, which may be very difficult!

On the other hand, computing the full inverse is wasteful since $[G^{'}_x ]^{-1}:X \rightarrow X$ (n-by-n matrix), whereas the operator on it's left is $F^{'}_x : X \rightarrow \mathbb{R}$ (1-by-n vector). We can safely ignore the action of $[G^{'}_x ]^{-1}$ in any direction that is in the kernel of $F^{'}_x$, since vectors in that kernel get sent to zero anyways.

Thus is it natural to look for a Riesz-representor $\lambda$ for the combined operator $F_x \circ [G^{'}_x]^{-1}: X \rightarrow \mathbb{R}$. Ie, a vector $\lambda$ such that $$F_x \circ [G^{'}_x]^{-1} \cdot x = (\lambda,x) ~~~ \forall x \in X$$ Or turning the equation around, we want $\lambda$ that solves $$G^{'}_x \lambda = F^{'}_x,$$ the familiar lagrange multiplier equation. We have to solve a n-by-n system once for a single right hand side, rather than fully inverting it for any possible right hand side. Then evaluating the implicit derivatives becomes much simpler, with the familiar formula $$\frac{d}{dq} F(S(q)) \cdot \delta q = (\lambda, G^{'}_q (q,S(q)) \cdot \delta q).$$

Thus the "Lagrangian" can be seen as a convenient shorthand for the object whose gradient gives you the correct equations for computing implicit derivatives efficiently.

(As an aside, there is also a completely different motivation for the lagrangian from game theory - hopefully someone else will post more about it.)