Difference between Gradient Descent method and Steepest Descent

What is the difference between Gradient Descent method and Steepest Descent methods?

In this book, they have come under different sections:

http://stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf

According to page 480, Gradient Descent is:

$$\Delta x=-\nabla f(x)$$

While page 490, for Steepest descent says:

$$\Delta x_{sd}=||\nabla f(x)||_*\Delta x_{nsd}$$ $$\Delta x_{nsd} = \text{argmin}\{\nabla f(x)^Tv~|~~~ ||v||\leq 1\}$$

I cannot understand their difference. How they are mathematically and geometrically different?

I am reading this book too, this is also a problem for me for a long time. The direction of gradient descent method is negative gradient. However the direction of steepest descent method is the direction such that

$Δx_{\text{nsd}}=\text{argmin}\{∇f(x)^Tv \quad| \quad ||v||≤1\}$

which is negative gradient only if the norm is euclidean. If the norm is other quadratic or l1norm, the result are not negative gradient.

The direction is -inv(P)*∇f(x), if the norm is quadratic norm.

In gradient descent, we compute the update for the parameter vector as $\boldsymbol \theta \leftarrow \boldsymbol \theta - \eta \nabla_{\!\boldsymbol \theta\,} f(\boldsymbol \theta)$. Steepest descent is typically defined as gradient descent in which the learning rate $\eta$ is chosen such that it yields maximal gain along the negative gradient direction. The part of the algorithm that is concerned with determining $\eta$ in each step is called line search.

I happen to also be looking at the same part of the Boyd's Convex Optimization book and thought to give my 2 cents on this matter:

Method of Gradient Descent: only cares about descent in the negative gradient direction.
Method of Steepest Gradient Descent: descents in the direction of the largest directional derivative.

I believe the critical difference here is the directional derivative ($\nabla f(x)^{T}v$ = gradient of $f$ at $x$ in direction $v$ ). The method of Steepest Descent can be viewed as (from Page 476 of Boyd's Convex Optimization book):

i.e., as the direction in the unit ball of $\| \cdot \|$ that extends farthest in the direction $−\nabla f(x)$.

Where the norm $\| \cdot \|$ constrains the direction that you could move to. This limit on the directional gradient changes the behavior of the descent. Like others have said, if you choose $\| \cdot \|_{2}$, the two methods are identical. There are other cases where one would favor an alternative norm for specific problems. For example, if you want to perform descent on a sparse dataset and you want to penalize $\|\cdot \|_{1}$ norm (as a convex relaxation of pseudo-norm $\|\cdot \|_{0}$), then you would probably want to use Steepest Gradient Descent with a $L_{1}$ norm.

Example of a countable compact set in the real numbers [closed]

If something is neither true nor false, what is it?

Were there any proofs of whether or not a statement could be proved true or false before Gödel's Incompleteness Theorems?

Product of measurable functions is measurable

Prove by Induction that every term of the following sequence is irrational

Can ring homomorphisms be characterized as ring maps such that preimage of any ideal is an ideal?

Why is it that the congruence relations usually correspond to some type of subobject?

Is there a direct proof for there is no rational number whose square is 2?

How to prove $\left \lceil \frac{n}{m} \right \rceil = \left \lfloor \frac{n+m-1}{m} \right \rfloor$?

How can there be multiple irreducible representations of a group each having distinct dimension?

RSA: is it easy to find the public key from the secret key?

Why is the condition enough for a matrix to be diagonalizable?