I want to calculate the derivative of the spectral norm of a symmetric square matrix $W$:

$$ \frac{\partial}{\partial w_{ij}} \|W\|_2 $$

How should I go about this?


Let $W \in \mathbb{R}^{m\text{x}n}$. $f(W)=||W||_2 = \sigma_1(W)$, where $\sigma_1(W)$ stands for the larger singular value of (W).

The SVD decomposition of $W$ is $W=U\Sigma V^{T}$. So $||W||_2=e_1^{T}U^{T}(U\Sigma V^{T})V e_1$. Hence, $f(W)=u_1^{T}Wv_1$, where $u_1$ and $v_1$ are the first column vectors of $U$ and $V$, respectively.

Hence, the derivative can be computed as:

$Df(W)(H)= u_1^{T}Hv_1 = trace(u_1^{T}Hv_1) = trace(v_1u_1^{T}H)$.

Thus, the gradient is $\nabla f(W) = v_1u_1^{T}$.

Ps.: Notice that a sufficient condition for the existence of the derivative is that $\sigma_1 \neq \sigma_2$, otherwise the maximization problem in the induced 2-norm has more than one argmax.


Here is an approach based on the implicit function theorem, similar to loup blanc's approach, with a connection to my other answer.

Let $W=U \Sigma V^T$ be a SVD of $W$, where $\Sigma = \operatorname{diag}(\sigma_1,...,\sigma_n)$, with $\sigma_k \ge \sigma_{k+1}$. Then $\|W\|_2 = \sigma_1$, where $\sigma_1 = \sqrt{\lambda_\max(W^TW)}$.

Let $\eta(x,A) = \det(x I -A^TA)$, then $\eta(x,A)=0$ iff $\sqrt{x}$ is a singular value of $A$. Note that $\eta$ is a polynomial in $x$ and the entries of $A$, hence so are the various derivatives.

Note that ${\partial \det(A) \over \partial A} (\Delta) = \operatorname{tr} ((\operatorname{adj} A) \Delta)$, and if $A$ is invertible, we have ${\partial \det(A) \over \partial A} (\Delta) = \det A \operatorname{tr} (A^{-1} \Delta ) $. The latter form is more convenient to work with.

If $\sqrt{x}$ is not a singular value of $W$, we have ${ \partial \eta(x, W) \over \partial x} = \det(x I -W^T W) \operatorname{tr} ((x I -W^TW)^{-1} )$. Using the SVD expansion (and continuity), we have ${ \partial \eta(x, W) \over \partial x} = \sum_k \prod_{i\neq k} (x-\sigma_i^2)$ for all $x$.

Hence we see that if $\sigma_1 > \sigma_2$ (which also implies $\sigma_1 >0$), then ${ \partial \eta(\sigma_1^2, W) \over \partial x}\neq 0$, and the implicit function theorem gives the existence of a differentiable function $\xi$ defined in a neighbourhood of $W$ such that $\xi(W) = \sigma_1^2$ and $\eta(\xi(W'),W') = 0$ for $W'$ near $W$, hence $\sqrt{\xi(W')}$ is a singular value of $W'$, and continuity of the roots of $x \mapsto \eta(x,W')$ as a function of $W'$ shows that $\|W'\|_2 = \sqrt{\xi(W')}$.

To compute the derivative of $\xi$, we need ${ \partial \eta(\sigma_1^2, W) \over \partial A}(H)$. For $x \neq \sigma_1^2$ near $\sigma_1^2$ we have \begin{eqnarray} { \partial \eta(x, W) \over \partial A}(H) &=& -\det(x I -W^T W) \operatorname{tr} ((x I -W^T W)^{-1} (W^TH+H^TW) ) \\ &=& -2 \det(x I -W^T W) \operatorname{tr} ((x I -W^T W)^{-1} W^TH ) \end{eqnarray} and so (expanding use the SVD), \begin{eqnarray} { \partial \xi(W) \over \partial W}(H) &=& \lim_{x \to \sigma_1^2} 2 { \operatorname{tr} ((x I -W^T W)^{-1} W^TH ) \over \operatorname{tr} ((x I -W^T W)^{-1} ) } \\ &=& 2 \lim_{x \to \sigma_1^2} { \operatorname{tr} ((x I -\Sigma^2)^{-1} \Sigma U^T H V ) \over \operatorname{tr} ((x I -\Sigma^2)^{-1} ) } \\ &=& 2 \lim_{x \to \sigma_1^2} { \sum_k { \sigma_k \over x-\sigma_k^2 } [U^THV]_{kk} \over \sum_k { 1 \over x-\sigma_k^2 } } \\ &=& 2 \lim_{x \to \sigma_1^2} { \sum_k (x-\sigma_1^2) { \sigma_k \over x-\sigma_k^2 } [U^THV]_{kk} \over \sum_k (x-\sigma_1^2) { 1 \over x-\sigma_k^2 } } \\ &=& 2 \sigma_1 [U^THV]_{11} \\ &=& 2 \sigma_1 \langle u , H v \rangle \end{eqnarray} where $u = U e_1,v=V e_1$ are left and right singular vectors of $W$ corresponding to the singular value $\sigma_1$.

Finally, since $n(W') = \|W'\|_2 = \sqrt{\xi(W')}$, we have ${\partial n(W) \over \partial W} = {1 \over 2 \sqrt{ \xi(W')}} {\partial \xi(W) \over \partial W } = \langle u , H v \rangle$, as above.

The above works for any matrix $W$ as long as $\sigma_1 > \sigma_2$.

If $W$ is symmetric, then we can write the spectral decomposition $W=U \Lambda U^T$ (this also functions as a SVD), and so $W^T W = U \Lambda^2 U^T$. Hence $\|W\|_2$ is the absolute value of the eigenvalue of largest absolute value. Hence the condition $\sigma_1 > \sigma_2$ corresponds to requiring $|\lambda_1| > |\lambda_k|$ for $k >1$, and in this case ${\partial n(W) \over \partial W}(H) = \langle u , H u \rangle$, where $u$ is a unit eigenvector corresponding to $\lambda_1$.

We have, of course, ${\partial n(W) \over \partial W_{ij}} = {\partial n(W) \over \partial W}(E_{ij}) = [u]_i [u]_j$.