Difference between Gradient Descent method and Steepest Descent
What is the difference between Gradient Descent method and Steepest Descent methods?
In this book, they have come under different sections:
http://stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf
According to page 480, Gradient Descent is:
$$\Delta x=-\nabla f(x)$$
While page 490, for Steepest descent says:
$$\Delta x_{sd}=||\nabla f(x)||_*\Delta x_{nsd}$$ $$\Delta x_{nsd} = \text{argmin}\{\nabla f(x)^Tv~|~~~ ||v||\leq 1\}$$
I cannot understand their difference. How they are mathematically and geometrically different?
I am reading this book too, this is also a problem for me for a long time. The direction of gradient descent method is negative gradient. However the direction of steepest descent method is the direction such that
$Δx_{\text{nsd}}=\text{argmin}\{∇f(x)^Tv \quad| \quad ||v||≤1\}$
which is negative gradient only if the norm is euclidean. If the norm is other quadratic or l1norm, the result are not negative gradient.
The direction is -inv(P)*∇f(x), if the norm is quadratic norm.
In gradient descent, we compute the update for the parameter vector as $\boldsymbol \theta \leftarrow \boldsymbol \theta - \eta \nabla_{\!\boldsymbol \theta\,} f(\boldsymbol \theta)$. Steepest descent is typically defined as gradient descent in which the learning rate $\eta$ is chosen such that it yields maximal gain along the negative gradient direction. The part of the algorithm that is concerned with determining $\eta$ in each step is called line search.
I happen to also be looking at the same part of the Boyd's Convex Optimization book and thought to give my 2 cents on this matter:
Method of Gradient Descent: only cares about descent in the negative gradient direction.
Method of Steepest Gradient Descent: descents in the direction of the largest directional derivative.
I believe the critical difference here is the directional derivative ($\nabla f(x)^{T}v$ = gradient of $f$ at $x$ in direction $v$ ). The method of Steepest Descent can be viewed as (from Page 476 of Boyd's Convex Optimization book):
i.e., as the direction in the unit ball of $\| \cdot \|$ that extends farthest in the direction $−\nabla f(x)$.
Where the norm $\| \cdot \|$ constrains the direction that you could move to. This limit on the directional gradient changes the behavior of the descent. Like others have said, if you choose $\| \cdot \|_{2}$, the two methods are identical. There are other cases where one would favor an alternative norm for specific problems. For example, if you want to perform descent on a sparse dataset and you want to penalize $\|\cdot \|_{1}$ norm (as a convex relaxation of pseudo-norm $\|\cdot \|_{0}$), then you would probably want to use Steepest Gradient Descent with a $L_{1}$ norm.