Taylor expansion of a function of a symmetric matrix

Consider a pair matrices with elements given by $$\eqalign{ M_{ij} &= \begin{cases} 1 &\text{if }(i=j) \\ \frac{1}{2} & \text{otherwise}\end{cases} \\ W_{ij} &= \begin{cases} 1 &\text{if }(i=j) \\ 2 & \text{otherwise}\end{cases} \\ }$$ which are Hadamard inverses of each other, i.e. $\;M\odot W={\tt1}$

Suppose that you have been given a function, and by hard work you have calculated its gradient $G$ and its Taylor expansion $$f(X+dX) \approx f(X) + G:dX$$ where the colon denotes the Frobenius inner product $\;A:B={\rm Tr}(A^TB)$

Everything looks great until someone points out that your problem has a symmetry constraint $$X={\rm Sym}(X)\doteq\tfrac{1}{2}\left(X+X^T\right)$$ The constraint implies $(X,G)$ are symmetric, so you might think the constrained gradient is $$\eqalign{ H &= {\rm Sym}(G) \\ }$$ but this is not correct. Fortunately, there is a way to calculate $H$ from $G$ $$\eqalign{ H &= W\odot{\rm Sym}(G) = W\odot G \quad\implies\quad G = M\odot H \\ }$$ Substituting this into the Taylor expansion yields $$\eqalign{ f(X) + G:dX &= f(X) + (M\odot H):dX \\ &= f(X) + H:(M\odot dX) \\ &= f(X) + (\sqrt{M}\odot H):(\sqrt{M}\odot dX) \\ }$$ NB: These matrices are symmetric with only $\left(\frac{n(n+1)}{2}\right)$ independent components.

You might think of the last expansion formula as the standard inner product after each factor has been projected using the elementwise square root of the $M$ matrix.

The Frobenius $\times$ Hadamard product generates a scalar triple product, i.e. $$A:B\odot C = \sum_i\sum_j A_{ij}B_{ij}C_{ij}$$ The order of the three matrices does not affect the value of this product.

Interestingly, if you had to enforce a skew constraint, i.e. $$X={\rm Skw}(X)\doteq\tfrac{1}{2}\left(X-X^T\right)$$ then the constrained gradient would satisfy your intuition
$$H={\rm Skw}(G)$$ with $\left(\frac{n(n-1)}{2}\right)$ independent components.


I think that the key problem is that such differential on "sets of matrices with dependent components" is not defined.

If $f:\mathbb{R}^m \rightarrow \mathbb{R}$ is differentiable, then the first order approximation in the direction of $v$ is: $$f(x+v)\approx f(x)+\nabla_f(x)\cdot v $$ with the usual dot product: $$\nabla_f(x)\cdot v=\sum_i \frac{\partial f}{\partial x_i}\,v_i $$

Now, if $m=n^2$ and you fancy reshaping vectors as square matrices and writing everything in uppercase, this is the same as: $$f(X+V)\approx f(X)+tr(D(X)^\top\, V )$$ where the $ij$ component of matrix $D(X)$ is $\frac{\partial\, f}{\partial\, X_{ij}}$ because the trace reproduces the usual dot product: $$tr(D(X)^\top\, V ) = \sum_i\sum_j D(X)_{ij}\,V_{ij}=\frac{\partial\, f}{\partial\, X_{ij}}\,V_{ij}$$

All of this is well known and I have only recalled it to have some notation at hand for the case where the components of $X$ are not "independent". One way to explain the problem in this case is that the domain is no longer $\mathbb{R}^m$ and you have to rewrite the function definition.

I will try to do this rewriting. For instance, let $X=\begin{pmatrix} a& b\\b & c\end{pmatrix}$ and you consider your function as $f:\mathbb{R}^3\to\mathbb{R}$ so that $f(X)=f(a,b,c)$ and $\nabla f=\left(\frac{\partial f}{\partial a},\frac{\partial f}{\partial b},\frac{\partial f}{\partial c}\right)$. But now the gradient cannot be cast into a square matrix. If you just repeat the derivative with respect to $b$ and place it twice on the matrix, then the trace does not recover the dot product but introduces an extra term.

Another way to see what is happening is to note that not every perturbation $V$ is valid, since $X+V$ may not be symmetric.

To summarize, you have to introduce a novel concept of differentiation on a set that is not a linear space, because the differential as such is not defined on such weird sets. (Spoiler alert: manifolds)

You can visualize the problem with a simpler example. Consider the function $f: \mathbb{R}^2 \to \mathbb{R}$, $f(x,y)=\frac{1}{2}(x^2+y^2)$. Then the gradient is $\nabla f(x,y)=(x,y)$. But imagine that an external influence forces the points to remain on the circle: $\mathcal{S}^1=\{(x,y)\in\mathbb{R}^2:x^2+y^2=1\}$, so the components $x,y$ are not "independent". (You can think of a centripetal force in physics or a constraint in optimization). Then, it is obvious that your function is constant, so the gradient must vanish.

And then all the differential geometry of manifolds starts...

Edit: Maybe I have not answered your question. You try to blame on the dot product, and it is true that you have to think a way to rewrite the dot product in matrix form. But I think the issue is more fundamental: it is the derivative itself that must be redefined. I am sure B&V know the rigourous formalism, but they tried to keep their text at a more elementary level. BTW, if your topic is optimization, maybe you can have a look at Absil's excellent book: Optimization Algorithms on Matrix Manifolds but, again, differential geometry is required.