In order to find the local minima of a scalar function $p(x), x\in \mathbb{R}^3$, I know we can use the gradient descent method: $$x_{k+1}=x_k-\alpha_k \nabla_xp(x)$$ where $\alpha_k$ is the step size and $\nabla_xp(x)$ is the gradient of $p(x)$.

My question is: what if $x$ must be constrained on a sphere, i.e., $\|x_k\|=1$? Then we are actually to find the local minima of $p(x)$ on a sphere. Can gradient descent be applied to constrained optimizations? Can anyone give any suggestions? Thanks.


There's no need for penalty methods in this case. Compute the gradient, $g_k=\nabla_xp(x)$, project it onto the tangent plane, $h_k=g_k-(g_k\cdot x_k)x_k$, and normalize it, $n_k=h_k/|h_k|$. Now you can use $x_{k+1}=x_k\cos\phi_k + n_k\sin\phi_k $ and perform a one-dimensional search for $\phi_k$, just like in an unconstrained gradient search, and it stays on the sphere and locally follows the direction of maximal change in the standard metric on the sphere.

By the way, this can be generalized to the case where you're optimizing a set of $n$ vectors under the constraint that they're orthonormal. Then you compute all the gradients, project the resulting search vector onto the tangent surface by orthogonalizing all the gradients to all the vectors, and then diagonalize the matrix of scalar products between pairs of the gradients to find a coordinate system in which the gradients pair up with the vectors to form $n$ hyperplanes in which you can rotate while exactly satisfying the constraints and still travelling in the direction of maximal change to first order; the angles by which you rotate in the hyperplanes are multiples of a single parameter (with the multipliers determined by the eigenvalues), so this is again reduced to a one-dimensional search.

[Edit:] (In response to the suggestion in the comment to use $x_{k+1}=\frac{x_k-\alpha_kg_k}{\|x_k-\alpha_kg_k\|}$ instead)

This is slightly disadvantageous compared to the great-circle solution. You're effectively searching on the same great circle, except in this approach you can only generate one half of it. If the optimum lies in the other half, then the optimal value of your parameter $\alpha_k$ will be $\pm\infty$, whereas in my formulation the compact search space $\phi_k\in[0,2\pi]$ maps to the entire great circle. Also, this formulation doesn't generalize to the case of $n$ orthonormal vectors, since you'd have to perform an orthonormalization at each step.

How to determine $\alpha_k$ is not a (new) problem; in every gradient search you need to have some way to determine the optimum in a one-dimensional search.


The sphere is a particular example of a (very nice) Riemannian manifold.

Most classical nonlinear optimization methods designed for unconstrained optimization of smooth functions (such as gradient descent which you mentioned, nonlinear conjugate gradients, BFGS, Newton, trust-regions, etc.) work just as well when the search space is a Riemannian manifold (a smooth manifold with a metric) rather than (classically) a Euclidean space.

The branch of optimization concerned with that topic is called Riemannian optimization or optimization on manifolds. There is a great reference book on the topic that you can access online for free:

Optimization algorithms on matrix manifolds, P.-A. Absil, R. Mahony, R. Sepulchre, Princeton university press, 2008.

https://press.princeton.edu/absil

Some of the theory presented in that book (and some more) is implemented in a Matlab toolbox called Manopt (which I develop with a number of collaborators), also available freely online:

http://www.manopt.org

The tutorial happens to start with an example on the sphere, which could be a handy starting point for the question raised by the OP (assuming Matlab is an option).

EDIT:

I wrote an introduction to optimization on manifolds, available here:

http://www.nicolasboumal.net/book


What we usually do in theory (not in practice) is that to ensure convergence of such algorithms, for a constraint $g : \mathbb R^n \to \mathbb R$ and $f : \mathbb R^n \to \mathbb R$ that we wish to minimize under the condition $g(x) = 0$, we define $$ \overline f : \mathbb R^n \to \overline{\mathbb R} \cup \{ + \infty\}, \qquad \overline f(x) = \begin{cases} + \infty & \text{ if } g(x) = 0 \\ f(x) & \text{ if } g(x) \neq 0. \\ \end{cases} $$ and we minimize $\overline f$ under no constraints, and clearly $f$ has the same minimum as $\overline f$. Now what you want to do in your numerical methods is defining a similar $\overline f$ by giving a very big upper bound on the values of $f$, to make sure your algorithm doesn't go in those directions. You might lose differentiability with this method, which can be quite annoying since you are using differentiation in your steepest descent method.

What you might want to try is using something like a mollifier : the idea is very simple. Instead of making a huge gap between the surface of the sphere and the outside of the surface (I mean outside by "not on the surface, which can mean inside or outside the ball... anyway), you make the gap smooth by using a bump function (see http://en.wikipedia.org/wiki/Bump_function for images) which is $0$ on the sphere and has a very high value when you're at a distance $\delta$ from the sphere. If you call this bump function $B(x)$ that has the property that $B(x) = 0$ whenever $\| x \| = 1$, then define $$ \overline f(x) = f(x) + B(x). $$ Since $B = 0$ on the sphere, $\overline f$ and $f$ will have the same minimum with respect to your constraint. What you're hoping is that the minimum you will find will not be outside your constraint. If that happens, make $\delta$ smaller. This technique might not work if your function behaves badly close to the frontier of the sphere (i.e. goes to very low values when close to the sphere in a weird manner even though there is a minimum on the sphere). If it has nice behavior though I don't expect any problem.

One example of such a bump function would be to define $$ B(x) = \begin{cases} \sin^2 \left( \frac{\| x \| - 1}{\delta} \right) & \text{ if } \left| \| x \| - 1 \right| < \delta \frac {\pi}2 \\ 1 & \text{ if not. } \\ \end{cases} $$ Make $\delta$ small enough and multiply the bump function by any big factor you need to make it practical. I don't know if it's the best idea, but its one I have. Note that my "bump function" is not a bump function in the mathematical sense (it's not compactly supported), but it's just a practical name in this post.

EDIT : Better choices of $B$ are to be expected, as one of my comments mention $B(x) = c(\|x\|-1)^2$, with $c$ large enough. I guess J.M.'s suggestion (reading non-linear programming references) might give further indications.

Hope that helps,


Thanks joriki and Patrick. Frankly I prefer joriki's answer because it considers the specific constraint. And Patrick's answer can be applied to more generic problems. For the record, I happen to have another solution, which I think is very similar to joriki's. I appreciate any comments on the method.

The idea is to use a rotation to satisfy the constraint $\|x_{k+1}\|=\|x_k\|$: $$x_{k+1}=R_kx_k$$ where $R_k$ is a rotation matrix ($R_k^TR_k=I, \det R_k=1$). Note rotation just change the direction of a vector, but no change the length. In order to determine $R_k$, I use the axis-angle representation of a rotation matrix (aka Rodrigues’ formula): $$R_k=I+[\omega_k]_{\times}\sin\theta_k+[\omega_k]_{\times}^2(1-\cos\theta_k)$$ where $\omega_k$ is the rotation axis, $\theta_k$ is the rotation angle and $[\omega]_{\times}$ is the skew-symmetric matrix associating with $\omega$. The rotation axis can be chosen as $$\omega_k=x_k \times n_k$$ and the step size $\theta_k$ might be determined by a one dimensional search.


EDIT: $$R_kx_k=x_k+\sin\theta_k (\omega_k\times x_k)+(1-\cos\theta_k)(\omega_k\omega_k^T-I)x_k$$ $$=x_k+\sin\theta_k (\omega_k\times x_k)-(1-\cos\theta_k)x_k =\sin\theta_k (\omega_k\times x_k)+\cos\theta_k x_k$$ In the above equation, we use the facts: $[\omega]{\times}^2=\omega\omega^T-\|\omega\|I$ and $\omega^Tx=0$ (note $\omega\times x\ne 0$). Choose $\omega_k\times x_k=n_k$, i.e., $\omega_k=x_k\times n_k$, the above equation is exactly the same as joriki's answer.


enter image description here

EDIT again: prove it is gradient descent. For $x_{k+1}=x_k+v$, as long as $v^T(-g)>0$, it is gradient descent. And $p(x)$ is guaranteed to decrease given sufficiently small step. For our problem, can we use this to prove? In our equation, $$x_{k+1}=x_k+\underset{v_k}{\underbrace{\sin\theta_k (\omega_k\times x_k)-(1-\cos\theta_k)x_k}}$$ so we need to check if $-v^Tg>0$. It is very interesting that unlike common cases, the evolving direction is depend on the step size $\theta$. We should know that gradient descent is only valid (mathematically) for sufficiently small step size. When $\theta$ is very small, $\sin\theta=\theta$ and $1-\cos\theta\approx0$ (second order). So we have $$v_k\approx\theta_k n_k$$ Obviously $-g^T\theta_k n_k>0$. So it is gradient descent.

Summary: in order to satisfy the constraint, we have $v_k=\sin\theta_k (\omega_k\times x_k)-(1-\cos\theta_k)x_k$ (complex). But in order to prove the descent property, we have $v_k\approx\theta_k n_k$ (simple and essential).