Newton's method vs. gradient descent with exact line search
tl;dr: When is gradient descent with exact line search preferred over Newton's method?
I simply don't understand why exact line search is ever useful, and here's my reasoning.
Let's say I have a function $f$ that I want to minimize. Gradient descent performs the updates:
$$\vec{x} \gets \vec{x} - t \vec{\nabla}f(\vec{x})$$
The optimal step size $t$ must satisfy:
$$\frac{d}{d t} f(\vec{x} - t\vec{h}) = 0$$
To solve this, I approximate $f$ with its Taylor polynomial:
$$f(\vec{x} - t\vec{h}) \approx f(\vec{x}) - t\vec{h}^\top \vec{\nabla}f(\vec{x}) + \frac{t^2}{2}\vec{h}^\top \vec{\nabla}^2f(\vec{x})\ \vec{h} + \ldots$$
...and then differentiate it with respect to $t$ and set it equal to $0$:
$$\frac{d}{d t} f(\vec{x} - t\vec{h}) \approx -\vec{h}^\top \vec{\nabla}f(\vec{x}) + t\vec{h}^\top \vec{\nabla}^2f(\vec{x})\ \vec{h} = 0$$
$$t = \frac{\vec{h}^\top \vec{\nabla}f(\vec{x})}{\vec{h}^\top \vec{\nabla}^2f(\vec{x})\ \vec{h}}$$
Since $\vec{h} = \vec{\nabla} f(\vec{x})$, we have:
$$t = \frac{\vec{\nabla}f(\vec{x})^\top \vec{\nabla}f(\vec{x})}{\vec{\nabla}f(\vec{x})^\top \vec{\nabla}^2f(\vec{x})\ \vec{\nabla}f(\vec{x})}$$
This is great, except that we can see that computing $t$ requires the Hessian $\vec{\nabla}^2f$ (expensive)!
But if I could really compute the Hessian, then wouldn't I be using Newton's method instead?
Why would I ever choose to do exact line search?
Since I seem to be the only one who thinks this is a duplicate, I will accept the wisdom of the masses :-) and attempt to turn my comments into an answer.
Here's the TL;DR version:
- what you have described is not an exact line search.
- a proper exact line search does not need to use the Hessian (though it can).
- a backtracking line search is generally preferred in practice, because it makes more efficient use of the gradients and (when applicable) Hessian computations, which are often expensive. (EDIT: coordinate descend methods often use exact line search.)
- when properly constructed, the line search should have no impact on your choice between gradient descent and Newton's method.
An exact line search is one that solves the following scalar minimization exactly---or, at least, to a high precision: $$t = \mathop{\textrm{argmin}}_{\bar{t}} f( x - \bar{t} h ) \tag{1}$$ where $f$ is the function of interest, $x$ is the current point, and $h$ is the current search direction. For gradient descent, $h=\nabla f(x)$; for Newton descent, $h=(\nabla^2 f(x))^{-1}\nabla f(x)$. The key concept, of course, is that you have temporarily reduced a multidimensional optimization problem down to a single dimension. But to call it an exact line search, you still have to determine a near-exact solution to $(1)$.
You haven't done this. Instead, you've taken a single Newton step in $t$. This is an approximate line search, not an exact one. To turn your approach into an exact line search you need to iterate: $$t_0=0, \quad t_{k+1} = f(x-t_k h) - \alpha_k\left(\left.\frac{d^2}{dt^2} f(x-t h) \right|_{t=t_k}\right)^{-1}\left.\frac{d}{dt} f(x-t h) \right|_{t=t_k}, \quad k=1,2,\dots$$ where $\alpha_k$ is a step size that you must select to ensure convergence. (We will talk about a good way to select $\alpha_k$ later.)
As you have seen, computing even one of these iterates requires the computation of the Hessian $\nabla^2 f(x-t_k h)$. And you are right to say that this makes no sense! If you are going to use the Hessian in the line search, you should use it in the outer descent steps. Granted, you don't have to invert the Hessian in the scalar case, so it might be slightly less expensive, but still, the point stands.
Your second mistake, then, is assuming that an exact line search must use the Hessian. This is not the case: any numerical method for solving $(1)$ will do; and for gradient descent, it makes sense to use the gradient alone for an exact line search. That is: $$t_0=0, \quad t_{k+1} = f(x-t_k h) - \alpha_k \left.\frac{d}{dt} f(x-t h) \right|_{t=t_k} \quad k=0,1,\dots$$ Again, $\alpha_k$ is a step size chosen to ensure convergence.
Even with this reduced complexity, an exact line search rarely makes sense. This is the point I made in the other question. After all, $$\left.\frac{d}{dt} f(x-t h) \right|_{t=t_k} = \langle \nabla f(x-th), h \rangle = \langle \nabla f\left(x-t\nabla f(x)\right), \nabla f(x)\rangle$$ Each step of the line search requires computing the full gradient at the new candidate point. If you're going to go through the trouble of computing the gradient, why should you limit yourself to moving in the direction of the old gradient?
EDIT: If evaluating $f$ alone is significantly less expensive than evaluating $\nabla f$, then one option is to use bisection to perform the exact line search. I will leave this as an exercise for the reader.
In practice, one rarely does exact line search. Instead, one does something called backtracking line search. In effect, a backtracking line search does only the first step of a numerical line search, and takes care to make sure that the step size $\alpha_1$ guarantees a sufficient reduction in the objective function. Since we're only taking one step, I'll drop the subscript and just call the step size $\alpha$. There are many ways to go about this, but here's what I do.
- Select constants $c_1>1$, $0<c_2,c_3<1$, and an initial step size; say $\alpha=1$.
- Select an initial guess of $\alpha\leftarrow c_1\alpha$. The idea is that for each step of your outer iteration (gradient or Newton), you're going to try a slightly larger step size so that you can converge faster.
- Check to see if $f(x-\alpha h) \leq f(x) - c_2\alpha\langle \nabla f(x),h \rangle$. This is called the Armijo-Goldstein condition.
- If that condition is satisfied, STOP. $\alpha$ is your step size, and $x-\alpha h$ is your new iterate.
- If that condition is not satisfied, reduce the step size with $\alpha\leftarrow c_3\alpha$, and repeat steps 3-5.
If you're doing a Newton iteration, you might change step 2 so that the maximum value of $\alpha$ is 1. That's because once you've reached the region of quadratic convergence, there really isn't going to be any benefit to having $\alpha>1$.
EDIT: I just attended a lecture yesterday where it was stated that exact line search is often employed in coordinate descent methods. This makes sense if the cost of descending on one variable is significantly cheaper than a full gradient descent.