Gradient of the $ {L}_{2, 1} $ Mixed Norm

This norm seems a bit weird, nonetheless its gradient can be written as $$\eqalign{ \frac{\partial N}{\partial X} &= X\odot\Big[(X\odot X)\,1_n\Big]^{\odot-1/2} \cr }$$ where $1_n$ is the vector with all elements equal to one, and $\odot$ represents the Hadamard (element-wise) product.

Update
Here are the details leading to the above result.

The conventional definition of the $L_{p,q}$ norm of $X\in{\mathbb R}^{m\times n}$ is $$ \|X\|_{p,q} = \left[\sum_{j=1}^n \left( \sum_{i=1}^m |X_{ij}|^p \right)^{q/p}\right]^{1/q} $$ The definition used in this question transposes the (p,q) indices as well as the order of the summations $$ \|X\|_{q,p} = \left[\sum_{i=1}^m \left( \sum_{j=1}^n |X_{ij}|^p \right)^{q/p}\right]^{1/q} $$ nevertheless, this is the defintion we will work with, and for typing convenience we'll simply call it $N$.

We need a few more things before we get started, the Frobenius inner product, the Hadamard (element-wise) product, and Hadamard exponentiation $$\eqalign { A:X &= \operatorname{tr}(A^TX) \cr X^{\odot 2} &= X\odot X \cr }$$ Let $A=\operatorname{abs}(X)$ where the abs function is also element-wise.

A particularly useful relationship between $A$ and $X$ is the following $$\eqalign { X\odot X &= A\odot A \cr X\odot dX &= A\odot dA \cr\cr }$$

Now we're ready to find the gradient $$\eqalign { N^q &= \bigg(A^{\odot p}\,1_n\bigg)^{\odot(q/p)}:1_m \cr\cr q\,N^{q-1}dN &= (q/p)\big(A^{\odot p}\,1_n\big)^{\odot(q/p-1)}\odot\big(pA^{\odot(p-1)}\odot dA\big)\,1_n:1_m \cr\cr N^{q-1}dN &= \big(A^{\odot p}\,1_n\big)^{\odot(q/p-1)}\odot 1_m:\big(A^{\odot(p-1)}\odot dA\big)\,1_n \cr &= \big(A^{\odot p}\,1_n\big)^{\odot(q/p-1)}:\big(A^{\odot(p-1)}\odot dA\big)\,1_n \cr &= \big(A^{\odot p}\,1_n\big)^{\odot(q/p-1)}1_n^T:A^{\odot(p-1)}\odot dA \cr &= \big(A^{\odot p}\,1_n1_n^T\big)^{\odot(q/p-1)}:A^{\odot(p-2)}\odot A\odot dA \cr &= \big(A^{\odot p}\,1_n1_n^T\big)^{\odot(q/p-1)}\odot A^{\odot(p-2)}:X\odot dX \cr &= \big(A^{\odot p}\,1_n1_n^T\big)^{\odot(q/p-1)}\odot A^{\odot(p-2)}\odot X:dX \cr \cr \frac{\partial N}{\partial X} &= N^{(1-q)}\,\big(A^{\odot p}\,1_{n\times n}\big)^{\odot(q/p-1)}\odot A^{\odot(p-2)}\odot X \cr \cr }$$ Substituting the values $(q\!=\!1,\, p\!=\!2)$ yields $$\eqalign { \frac{\partial N}{\partial X} &= \big[A^{\odot 2}\,1_{n\times n}\big]^{\odot-1/2}\odot X \cr &= X\odot \big[(X\odot X)\,1_{n\times n}\big]^{\odot-1/2} \cr\cr }$$ Note that the commutivity (and mutual commutivity) of the Hadamard and Frobenius products were used in several places in the derivation.

I'm also using a non-standard definition for negative Hadamard powers. If an element is non-zero take the indicated power, but if the element is zero leave it as zero.

I worked it out and it turns out to be a matrix of dimensions $m \times n$ such that the $i^{\text{th}}$ row equals $\frac{x_{i,*}}{||x_{i,*}||}$. For my purposes, I need to write this result in terms of $X$ and not its rows, but I don't know how.

For convenience, define the following vector and its associated diagonal matrix. $$\eqalign{ b &= {\rm diag}(XX^T),\quad B &= {\rm Diag}(b) \doteq {\rm Diag}(XX^T) \\ }$$ Then @frank 's gradient for the $L_{1,2}$ norm can be reformulated as $$\eqalign{ \frac{\partial\|X\|_{1,2}}{\partial X} &= X\odot\Big((X\odot X)\,{\tt\large 11}^T\Big)^{\odot-1/2} \\ &= X\odot\Big({\rm diag}(XX^T)\,{\tt\large 1}^T\Big)^{\odot-1/2} \\ &= X\odot\Big(b{\tt\large 1}^T\Big)^{\odot-1/2} \\ &= X\odot\Big(b^{\odot-1/2}\,{\tt\large 1}^T\Big) \\ &= B^{-1/2}X \\ &= {\rm Diag}(XX^T)^{-1/2}X \\ }$$ In this form, it is strikingly similar to the gradient of the Nuclear Norm. $$\eqalign{ \frac{\partial\|X\|_{*}}{\partial X} &= (XX^T)^{-1/2}X \\ \\ }$$ NB: Utilizing the above B-variables affords a simpler way to calculate the gradient. $$\eqalign{ \|X\|_{1,2} &= b^{\odot 1/2}:1 \\ d\,\|X\|_{1,2} &= \tfrac{1}{2}b^{\odot-1/2}\odot db:1 \\ &= \tfrac{1}{2}b^{\odot-1/2}:db \\ &= \tfrac{1}{2}B^{-1/2}:dB \\ &= \tfrac{1}{2}B^{-1/2}:2\,{\rm Diag}({\rm sym}(dX\,X^T)) \\ &=B^{-1/2}X:dX \\ \frac{\partial\|X\|_{1,2}}{\partial X} &=B^{-1/2}X \\ }$$ One more comment. The above norm is row-oriented, but there is an analogous column-oriented norm given by $\|X^T\|_{1,2}$.

Temporarily changing the independent variable to $Y=X^T$, and then back to $X$ yields $$\eqalign{ d\,\|Y\|_{1,2} &= {\rm Diag}(YY^T)^{-1/2}Y:dY \\ &= {\rm Diag}(X^TX)^{-1/2}X^T:dX^T \\ &= X\;{\rm Diag}(X^TX)^{-1/2}:dX \\ \frac{\partial\|X^T\|_{1,2}}{\partial X} &= X\;{\rm Diag}(X^TX)^{-1/2} \\ }$$ Once again, there is a similar expression for the gradient of the Nuclear Norm. $$\eqalign{ \frac{\partial\|X\|_{*}}{\partial X} &= X\,(X^TX)^{-1/2} \\ }$$

Gradient of the $ {L}_{2, 1} $ Mixed Norm

Related

Recent Posts