Understanding notation of derivatives of a matrix

The initial problem was the following: $\mathbf A = (a_{ij})_{1\leq i,j \leq n}$ an arbitrary square matrix with complex entries and $f(z) = \sum_{m=0}^\infty b_m z^m$ an entire function. Then $$\frac\partial{\partial a_{ij}} \mathrm{tr}\ f(\mathbf A) = \big(f'(\mathbf A)\big)_{ji}.$$

Using e.g. Notions of Matrix Differentiation, Differential and derivative of the trace of a matrix and Derivative of the trace of matrix product $(X^TX)^p$ , I tried to understand the notions of derivatives of a matrix. So I started with: $$\frac\partial{\partial \mathbf A} \mathrm{tr}\ \mathbf A^p = p\big(\mathbf A^T\big)^{p-1} \tag{$*$}$$ But there seems to be different notions. At least, I found two notions to correlate:

Let $\mathbf A$ $m \times n$ matrix, then $\mathrm{vec}\ \mathbf A = \begin{pmatrix} \mathbf a_1\\ \vdots \\ \mathbf a_n\end{pmatrix}$ is a $mn\times 1$ column vector. And we use the Fréchet-differentiability $$f(x+h) = f(x) + \mathrm Df(x)h + r_x(h),$$ where $\mathrm Df(x)$ is the differential and $\mathrm d f(x,h) = \mathrm Df(x)h = \langle \nabla f(x), h\rangle$ and $\mathrm Df(x)^T = \nabla f(x)$ the gradient. So the differential makes sense if the original function is defined on a circle $B(x,r)$ around $x$ with radius r, and $x + h \in B(x,r)$. Then the differential is somewhat $$\mathrm Df(\mathbf A) = \frac{\partial f(\mathbf A)}{\partial(\mathrm{vec}\ \mathbf A)^T}.$$ Then the differential is linear and obeys the product rule. Since the trace is linear, we get $\mathrm d \ \mathrm{tr}\ f = \mathrm{tr}(\mathrm df)$, where $$\mathrm{tr}(\mathbf A^T \mathbf B) = \sum_{j=1}^n\sum_{i=1}^n a_{ij}b_{ij} = (\mathrm{vec}\ \mathbf A)^T \mathrm{vec}\ \mathbf B.$$

  1. Can we conclude therefore $\mathrm d \ \mathrm{tr}\ f(\mathbf A) = \mathrm{tr}(f'(\mathbf A) \ \mathrm d\mathbf A)$ as $\mathrm d f(\mathbf A) = f'(A)\mathrm \ \mathrm d\mathbf A$ from the formalism? If we simply use this formula, why do we need the transpose $\mathbf A^T$ of $\mathbf A$ in ($*$)?
  2. How does the notation in 1. (found at Notions of Matrix Differentiation) corresponds to the notation I used?

Using the formalism from above we can show that $\mathrm D\mathrm tr \mathbf A^p = p \ \big(\mathrm{vec}(\mathbf A^T)^{p-1}\big)^T$, since $$\begin{align} \mathrm d\ \mathrm tr \mathbf A^p &= \mathrm tr \ \mathrm d \mathbf A^p\\ &= \mathrm{tr} \big( (\mathrm d \mathbf A)\mathbf A^{p-1} + \mathbf A(\mathrm d\mathbf A)\mathbf A^{p−2}+ \dots + \mathbf A^{p−1}(\mathrm d\mathbf A)\big)\\ &= \text{linearity and cyclic permutation}\\ &= p \ \mathrm{tr} \mathbf A^{p−1}(\mathrm d \mathbf A)\\ &= p \big(\mathrm{vec}(\mathbf A^T)^{p-1}\big)^T \mathrm d \mathrm{vec}\ \mathbf A \end{align}$$ Thus we have $$\begin{align} \mathrm d \ \mathrm tr \mathbf A^p &= p \ \big(\mathrm{vec}(\mathbf A^T)^{p-1}\big)^T \mathrm d \mathrm{vec}\ \mathbf A\\ \mathrm D\ \mathrm tr \mathbf A^p &= p \ \big(\mathrm{vec}(\mathbf A^T)^{p-1}\big)^T \end{align}$$

Now an easy example: Let $$\mathbf A = \begin{pmatrix} x & z\\ z & y\end{pmatrix} \qquad \mathbf B = \begin{pmatrix} x & v\\ w & y\end{pmatrix},$$ then $$\mathbf A^2 = \begin{pmatrix} x^2+z^2 & \\ & y^2+z^2\end{pmatrix} \qquad \mathbf B^2 = \begin{pmatrix} x^2+vw & \\ & y^2+vw\end{pmatrix},$$ $$\mathrm{tr}\ \mathbf A^2 = x^2+y^2+2z^2 \qquad \mathrm{tr}\ \mathbf B^2 = x^2+y^2+2vw,$$ but hence $$\frac\partial{\partial \mathbf A}\mathrm{tr}\ \mathbf A^2 = \begin{pmatrix} 2x & 4z\\ 4z & 2y\end{pmatrix} \neq 2(\mathbf A^T)^{2-1} \qquad \frac\partial{\partial \mathbf B}\mathrm{tr}\ \mathbf B^2 = \begin{pmatrix} 2x & 2w\\ 2v & 2y\end{pmatrix} = 2(\mathbf B^T)^{2-1}.$$

  1. Where is the problem? Since the formula should hold for any square matrix.
  2. Can the initial problem be solved using Einstein/index notation?
  3. Can the initial problem be solved by using that $$\mathrm{tr} \mathbf A^p = \sum_{i_1,...,i_p=1}^n a_{i_1i_2}...a_{i_{p-1}i_p}a_{i_pi_1}?$$

Solution 1:

Congratulations, you've discovered something very subtle about matrix calculus! In section 2.8 of the Matrix Cookbook, there is a discussion of "Structured Matrices" which addresses situations like this.

Let $G$ denote the gradient as calculated by the trace formula, without regard to any special structure that the matrix may have. Now you wish to enforce a symmetry constraint.

The recipe for the constrained gradient in this case is $$\eqalign{ S &= G + G^T - I\circ G \cr }$$ where $(\circ)$ denotes the Hadamard (aka elementwise) product.


Note however that you should still use $G$, and not $S$, to calculate the differential of the function $$\eqalign{ df &= \sum_{i,j} G_{ij}\,dA_{ij} \neq \sum_{i,j} S_{ij}\,dA_{ij} \cr }$$ because the off-diagonal terms will be counted twice by a naive summation.

If you want to use $S$ to calculate the differential, then you must do the sum more carefully $$\eqalign{ df &= \sum_{i\geq j} S_{ij}\,dA_{ij} \cr }$$

Update

The paper linked by Albus in the comments proves a very interesting identity.
Any matrix, whether symmetric or not, satisfies the following $$\eqalign{ {\rm vech}\big(X+X^T-I\circ X\big) &= D^T {\rm vec}(X) \\ }$$ where $D$ is the Duplication matrix, which was originally defined to recover the full vectorization of a symmetric matrix from its half-vectorized form $$\eqalign{ {\rm vec}(A) &= D\;{\rm vech}(A) \\ }$$ Using these results, we have three ways of writing the differential of a function. $$\eqalign{ df &= G:dA \qquad&\big({\rm Matrix\,form}\big) \\ &= {\rm vec}(G):{\rm vec}(dA) \qquad&\big({\rm Vec\,form}\big) \\ &= {\rm vech}(S):{\rm vech}(dA) \qquad&\big({\rm Half\,vec\,form}\big) \\ }$$ The last expression is only valid when $A=A^T,\,$ the others are valid for all matrices.

The derivatives, with respect to the vector of fully independent components, can be calculated in the form of a half-vec, and then reshaped into a matrix. $$\eqalign{ g_{s} &= \frac{\partial f}{\partial {\rm vech}(A)} = {\rm vech}(S) \\ S &= {\rm vech}^{-1}\big(g_{s}\big) \\ }$$ The question comes down to terminology $-$ in what sense can $S$ be called the gradient.
It certainly behaves like a gradient in the half-vec space.

NB:   The colon product used above is defined as $$A:B = {\rm Tr}(A^TB) = {\rm Tr}(AB^T)$$ and is applicable to vectors as well as matrices.


Update #2

This update is to answer another question raised in the comments:

Given a function $f=f(A)$ what is the "best" way to calculate the gradient?

IMHO, the best way to perform such analysis is to introduce an unconstrained matrix $X$ and use it to construct the matrix $A$ so as to satisfy any constraints.

For example, the construction for an SPD constraint might be $A = XX^T$
in which case the gradient calculation would be $$\eqalign{ df &= G_a:dA \\ &= G_a:\big(dX\,X^T+X\,dX^T\big) \\ &= \big(G_a+G_a^T\big)\,X:dX \\ G_x = \frac{\partial f}{\partial X} &= \big(G_a+G_a^T\big)\,X \\ }$$ where $G_a$ is a well-known gradient for an arbitrary matrix from a trusted reference.

But now $G_x$ is a gradient which you can use to calculate (via gradient descent, conjugate gradients, etc) a solution to your problem $X=X_s\,$ after which the corresponding constrained matrix can be constructed as $\,A_s = X_s X_s^T$

Some other useful constructions are $$\eqalign{ A &= I\circ X \qquad&\big(A{\rm \;is\,diagonal}) \\ A &= P\circ X \qquad&\big(A{\rm \;is\,patterned}) \\ A &= X-X^T \qquad&\big(A{\rm \;is\,skew\,symmetric}) \\ A &= \left(\frac{2I+X-X^T}{2I-X+X^T}\right) \qquad&\big(A{\rm \;is\,orthogonal}) \\ }$$ In the case of a symmetric constraint, you can use the obvious construction $$A=\tfrac{1}{2}(X+X^T) \;\doteq\; {\rm sym}(X)$$ and calculate the gradient as $$\eqalign{ df &= G_a:dA \\ &= G_a:{\rm sym}(dX) \\ &= {\rm sym}(G_a):dX \\ G_x = \frac{\partial f}{\partial X} &= \tfrac{1}{2}\big(G_a+G_a^T\big) \\ }$$ and this is precisely the result of Panda et al.

Now consider an alternate construction base on the unconstrained vector $$x = {\rm vech}(A) \quad\iff\quad A={\rm vech}^{-1}(x)$$ whose gradient calculation is $$\eqalign{ df &= G:dA \\ &= {\rm vec}(G):{\rm vec}(dA) \\ &= {\rm vec}(G):D\,dx \\ &= D^T{\rm vec}(G):dx \\ &= {\rm vech}(G+G^T-I\circ G):dx \\ &= {\rm vech}(S):dx \\ g_x = \frac{\partial f}{\partial x} &= {\rm vech}(S) \\ &= E\;{\rm vec}(S) \\ &= E\,(g+Kg-{\rm vec}(I)\circ g) \\ &= E(I+K-Y)\,g \\ G_x &= {\rm vech}^{-1}(g_x) \\ }$$ where $(D,E,K)$ are the (duplication, elimination, commutation) matrices associated with Kronecker products, $\,g={\rm vec}(G),\,$ and $\,Y={\rm Diag}\big({\rm vec}(I)\big).$

This is the gradient that other authors have in mind. Although they shouldn't write it as a matrix. Instead they should work with the underlying unconstrained $g_x$ vector.