Optimal step size in gradient descent

Suppose a differentiable, convex function $F(x)$ exists. Then $b = a - \gamma\nabla F(a)$ implies that $F(b) \leq F(a)$ given $\gamma$ is chosen properly. The goal is to find the optimal $\gamma$ at each step. In my book, in order to do this, one should minimize $G(\gamma)=$ $F(x-\gamma\nabla F(x))$ for $\gamma$. It also says that it should be minimized via a line search. Why can't this function be minimized by simple calculus?

I am unsure what it means to perform a line search on this function. I am also unsure why we want to minimize this function to find the optimal step size.

Solution 1:

You are already using calculus when you are performing gradient search in the first place. At some point, you have to stop calculating derivatives and start descending! :-)

In all seriousness, though: what you are describing is exact line search. That is, you actually want to find the minimizing value of $\gamma$, $$\gamma_{\text{best}} = \mathop{\textrm{arg min}}_\gamma F(a+\gamma v), \quad v = -\nabla F(a).$$ It is a very rare, and probably manufactured, case that allows you to efficiently compute $\gamma_{\text{best}}$ analytically. It is far more likely that you will have to perform some sort of gradient or Newton descent on $\gamma$ itself to find $\gamma_{\text{best}}$.

The problem is, if you do the math on this, you will end up having to compute the gradient $\nabla F$ at every iteration of this line search. After all: $$\frac{d}{d\gamma} F(a+\gamma v) = \langle \nabla F(a+\gamma v), v \rangle$$ Look carefully: the gradient $\nabla F$ has to be evaluated at each value of $\gamma$ you try.

That's an inefficient use of what is likely to be the most expensive computation in your algorithm! If you're computing the gradient anyway, the best thing to do is use it to move in the direction it tells you to move---not stay stuck along a line.

What you want in practice is a cheap way to compute an acceptable $\gamma$. The common way to do this is a backtracking line search. With this strategy, you start with an initial step size $\gamma$---usually a small increase on the last step size you settled on. Then you check to see if that point $a+\gamma v$ is of good quality. A common test is the Armijo-Goldstein condition $$F(a+\gamma v) \leq F(a) - c \gamma \|\nabla F(a)\|_2^2$$ for some $c<1$. If the step passes this test, go ahead and take it---don't waste any time trying to tweak your step size further. If the step is too large---for instance, if $F(a+\gamma v)>F(a)$---then this test will fail, and you should cut your step size down (say, in half) and try again.

This is generally a lot cheaper than doing an exact line search.

I have encountered a couple of specific cases where an exact line search could be computed more cheaply than what is described above. This involved constructing a simplified formula for $F(a+\gamma v)$ , allowing the derivatives $\tfrac{d}{d\gamma}F(a+\gamma v)$ to be computed more cheaply than the full gradient $\nabla F$. One specific instance is when computing the analytic center of a linear matrix inequality. But even in that case, it was generally better overall to just do backtracking.

Solution 2:

There is a good discussion of this in chapter 10 of Numerical Recipes. Old versions are free online.

You are right that if you have $F$ in a simple enough form, you can minimize over $\gamma$ by calculus. You might even be able to find the minimum directly, without iteration. Often you don't have it in that form. $x$ and $\bigtriangledown F(x)$ are both vectors and it may not be feasible to compute $\bigtriangledown F(x)$ analytically but you can search for a minimum.

What it means to perform a line search is hidden in the symbolism. The value of $G(\gamma)$ is precisely the value of $F$ along a line from the current point $x$ in the direction $\bigtriangledown F(x)$. It should remind you of a parameterized line in three dimensions: a point plus a variable times a direction vector. The reason you do this is because this is the best point along that line. You expect that the value will decrease along that direction (because you have chosen the gradient, which is the direction of greatest decrease) but if you go too far it will start to increase again. Why not stop at the bottom of the valley and try again?

Some of the methods in Numerical Recipes don't need any computations of the gradient at all. They come up with directions to minimize over in other ways.

Solution 3:

Not your question, but an adaptive step size can beat a constant $\gamma$, and a $\gamma_i$ per component can beat a single $\gamma$ for all components. See RmsProp , rmsprop.py
and Zeiler, ADADELTA: An adaptive learning rate method, 2012, 6p.
But be careful with recursive / IIR filters !

Optimal step size in gradient descent

Solution 1:

Solution 2:

Solution 3:

Related

Recent Posts