Gradient And Hessian Of General 2-Norm

Given $f(\mathbf{x}) = \|\mathbf{Ax}\|_2 = (\mathbf{x}^\mathrm{T} \mathbf{A}^\mathrm{T} \mathbf{Ax} )^{1/2}$,

$\nabla f(\mathbf{x}) = \frac {\mathbf{A}^\mathrm{T} \mathbf{Ax}} {\|\mathbf{Ax}\|_2} = \frac {\mathbf{A}^\mathrm{T} \mathbf{Ax}} {(\mathbf{x}^\mathrm{T} \mathbf{A}^\mathrm{T} \mathbf{Ax} )^{1/2}}$

$\nabla^2 f(\mathbf{x}) = \frac { (\mathbf{x}^\mathrm{T} \mathbf{A}^\mathrm{T} \mathbf{Ax} )^{1/2} \cdot \mathbf{A}^\mathrm{T} \mathbf{A} - (\mathbf{A}^\mathrm{T} \mathbf{Ax})^\mathrm{T} (\mathbf{x}^\mathrm{T} \mathbf{A}^\mathrm{T} \mathbf{Ax} )^{-1/2} \mathbf{A}^\mathrm{T} \mathbf{Ax} } {(\mathbf{x}^\mathrm{T} \mathbf{A}^\mathrm{T} \mathbf{Ax} ) } = \frac { \mathbf{A}^\mathrm{T} \mathbf{A} } { (\mathbf{x}^\mathrm{T} \mathbf{A}^\mathrm{T} \mathbf{Ax} )^{1/2}} - \frac {\mathbf{x}^\mathrm{T} \mathbf{A}^\mathrm{T} \mathbf{A} \mathbf{A}^\mathrm{T} \mathbf{Ax} } { (\mathbf{x}^\mathrm{T} \mathbf{A}^\mathrm{T} \mathbf{Ax} )^{3/2} }$

I guess I am looking for confirmation that I have done the above correctly. The dimensions match up except for the second term of the Hessian is a scalar, which makes me think that something is missing.

Edit: Also, the last equality reduces to

$\nabla^2 f(\mathbf{x}) = \frac {\mathbf{A}^\mathrm{T} \mathbf{A} - \nabla f(\mathbf{x})^\mathrm{T} \nabla f(\mathbf{x})} {\|\mathbf{Ax}\|_2}$


Solution 1:

It is easier to work with $\phi(x) = \frac{1}{2} f^2(x)$. Just expand $\phi$ around $x$.

$\phi(x+\delta) = \frac{1}{2} (x + \delta)^T A^T A (x + \delta) = \phi(x) + x^TA^TA \delta + \frac{1}{2} \delta^T A^T A \delta$. It follows from this that the gradient $\nabla \phi(x) = A^T A x$, and the Hessian is $H = A^TA$.

To finish, let $g(x) = \sqrt{2x}$, and note that $f = g \circ \phi$. To get the first derivative, use the composition rule to get $D f(x) = Dg(\phi(x)) D \phi(x)$, which gives $Df(x) = \frac{1}{\sqrt{2 \phi(x)}} x^T A^T A = \frac{1}{\|Ax\|} x^T A^T A$.

Let $\eta(x) = \frac{1}{\|Ax\|}$, and $\gamma(x) = x^T A^T A$, and note that $D f(x) = \eta(x) \cdot \gamma(x)$, so we can use the product rule. Let $h(x) = Df(x)$ then the product rule gives $D h(x) (\delta) = (D \eta(x) (\delta)) \gamma(x) + \eta(x) D \gamma(x) (\delta)$.

Expanding this yields: $Dh(x)(\delta) = (- \frac{1}{\|Ax\|^2} \frac{1}{\|Ax\|} x^T A^T A \delta) x^T A^T A + \frac{1}{\|Ax\|} \delta^T A^T A $. Noting that $x^T A^T A \delta = \delta^T A^T A x$, we can write this as: $$Dh(x)(\delta) = \delta^T(\frac{1}{\|Ax\|} A^T A - \frac{1}{\|Ax\|^3 } A^T A x x^T A^T A),$$ or alternatively: $$D^2 f(x) = \frac{1}{\|Ax\|} A^T A - \frac{1}{\|Ax\|^3 } A^T A x x^T A^T A .$$

The only difference with the formula given in the question is that the latter dyad was written incorrectly (instead of the dyad $g g^T$, you have the scalar $g^T g$).