What Is the Motivation of Proximal Mapping / Proximal Operator?
For a convex function $h$, its proximal operator is defined as: $$\operatorname{prox}_h(x)=\arg\min_u \Big(h(u)+\frac{1}{2}\|u-x\|^2\Big)$$ Can anyone provide an intuitive explanation/motivation of proximal mapping?
Solution 1:
Here are two interpretations which can lend useful intuition in some circumstances:
Prox is a generalized projection
Consider the indicator of some nonempty convex set $X$: $$i_X(x) = \begin{cases}0 & \text{if } x \in X \\ \infty & \text{if } x \not \in X\end{cases}.$$ Then we find that the proximal mapping is the projection onto $X$: $$\begin{align}\text{prox}_{\alpha i_X}(y) &= \underset{x}{\text{argmin}}(\alpha i_X(x)+\frac{1}{2}\|x - y\|^2) \\ &= \underset{x \in X}{\text{argmin}}(\frac{1}{2}\|x - y\|^2) \\ &= \text{proj}_X(y).\end{align}$$
Prox is backward Euler
Consider the backward Euler update: $x^{k+1} = x^k - \alpha \nabla f(x^{k+1})$ for some differentiable $f$. This update can be viewed as a proximal point update $x^{k+1} = \text{prox}_{\alpha f}(x^k)$, since for differentiable convex functions the optimality condition is $0 = \nabla (\alpha f(x^\star) + \frac{1}{2}\|x^\star - x^k\|^2) = \alpha \nabla f(x^\star) + x^\star - x^k$.
Solution 2:
If $h$ is a closed convex function, then the proximal operator of $h$ (with parameter $t > 0$) is defined by $$ \text{prox}_{th}(\hat x) = \arg \min_x \, h(x) + \frac{1}{2t} \|x - \hat x \|_2^2. $$ When we evaluate the prox-operator of $h$, we are attempting to reduce the value of $h$, but we are penalized for straying too far from $\hat x$. (The parameter $t$ is like a "step size" that controls how much we are penalized for moving away from $\hat x$.)
You can see that evaluating a prox-operator might be a useful sub-step in an optimization algorithm. For example, suppose that we want to solve the optimization problem $$ \text{minimize} \quad g(x) + h(x) $$ where $g$ and $h$ are convex, $g$ is differentiable, and $h$ is "simple" in the sense that its prox-operator can be evaluated efficiently. (Importantly, $h$ is not required to be differentiable.) A natural strategy is to first reduce the value of $g$ by taking a step in the negative gradient direction, then reduce the value of $h$ by applying the prox-operator of $h$, and repeat. This strategy yields the following iteration: $$ x^{k+1} = \text{prox}_{th}(x^k - t \nabla g(x^k)). $$ (This iteration is known as the proximal gradient method. You could implement it right now in Python to solve famous optimization problems such as the Lasso problem from statistics.)
Here is an alternative way to derive the proximal gradient method and, in the process, discover the proximal operator. Suppose again that we want to minimize $g(x) + h(x)$, with $g$ and $h$ as given above. A natural strategy is to replace $g$ with a local linear approximation to $g$ (this is the fundamental strategy of calculus), and also to add a penalty term to the objective function which penalizes us from straying too far from the current iterate (this ensures that the local linear approximation to $g$ remains accurate). This strategy yields the following iteration: $$ x^{k+1} = \arg \min_x \quad h(x) + \underbrace{g(x^k) + \nabla g(x^k)^T(x - x^k)}_{\approx \, g(x)} + \frac{1}{2t} \|x - x^k \|_2^2. $$ Now if we combine the two terms on the right by completing the square, and then omit terms that do not depend on $x$, we find that \begin{align} x^{k+1} &= \arg \min_x \, h(x) + \frac{1}{2t} \|x - (x^k - t\nabla g(x^k)) \|_2^2 \\ &= \text{prox}_{th}(x^k - t\nabla g(x^k)). \end{align} This is the proximal gradient iteration. We see that the proximal operator of $h$ has appeared out of thin air. To me this seems like the most likely way that someone would discover the idea of a proximal operator.
Another motivation for the proximal operator comes from the viewpoint of monotone operator theory. Minimizing a closed convex function $h$ is equivalent to finding a point $x$ such that $$ \tag{1} 0 \in \partial h(x). $$ Here $\partial h(x)$ is the subdifferential of $h$ at $x$. The set-valued mapping $x \mapsto \partial h(x)$ is the best example of a "monotone operator", and problem (1) is called a "monotone inclusion problem". Multiplying both sides by $t$ and then adding $x$ to both sides, we find that \begin{align} & 0 \in \partial h(x) \\ \iff & x \in (I + t \partial h)(x) \\ \iff & x \in (I + t \partial h)^{-1}(x). \end{align} The operator $(I + t \partial h)^{-1}$ is called the "resolvent" of the operator $\partial h$, and it has nice properties such as the following: if $x \in \mathbb R^n$, then $(I + t \partial h)^{-1}(x)$ is a singleton. Thus, $(I + t \partial h)^{-1}$ can be viewed as a function (rather than a set-valued mapping), and solving $0 \in \partial h(x)$ is equivalent to solving
$$ x = (I + t \partial h)^{-1}(x). $$ A natural idea is to solve for $x$ using fixed point iteration. The resulting iteration is $$ x^{k+1} = (I + t \partial h)^{-1}(x^k). $$
Punch line: The function $(I + t \partial h)^{-1}$ is none other than the proximal operator of $h$, and this iteration is known as the "proximal point algorithm".
Solution 3:
There are many good answers here already, but for the sake of completeness I will add on more intuition. The proximal operator of $g$ (with parameter $\lambda$) can also be seen as a gradient step (with stepsize $\lambda$) with respect to the Moreau envelope $g_\lambda$ of $g$. The Moreau envelope is a smooth under approximation of the original function and is given by $$ g_\lambda(x) = \min_u \{g(u) + \frac{1}{2 \lambda} || u-x||^2\}. $$ Evidently its definition is closely related to the proximal operator. The above mentioned identity then reads as $$ \operatorname{prox}_{\lambda g}(x) = x - \lambda \nabla g_\lambda(x). $$