generic rule matrix differentiation (Hadamard Product, element-wise)
I struggle with taking the derivative of the Hadamard-Product?
Let us consider $f(x)=x^TAx=x^T(Ax)$. We know
$$\frac{\partial}{\partial x} x^TAx = (A+A^T)x.$$
The Matrix-Cookbook claimed $d(XY)=d(X)Y+Xd(Y)$ and $$\frac{\partial}{\partial x} x^Ta = \frac{\partial}{\partial x}a^Tx = a.$$
Setting $X:=x^T$ and $Y:=Ax$ we have \begin{align*} X &= x^TE & d_x(X) = E\\ Y &= Ax & d_x(Y) = A\\ \end{align*} This gives $$d(XY)=d(X)Y+Xd(Y)= 1^TAx + x^TA= Ax + x^TA$$ This format (dimension) is incorrect. A generic rule seems to be $d(XY)=d(X)Y+(Xd(Y))^T$ instead? What does a generic rule look like for the Hadamard product $[a\odot b]_i=[a]_i\cdot [b]_i$? The Matrix-Cookbook states: $$d(X\odot Y)=d(X)\odot Y+X\odot d(Y).$$
For example, the derivative of $$ (x\odot y)^TA(x\odot y)$$ we have \begin{align*} X &= (x\odot y)^TE & d_x(X) &= \left(d(x\odot y)\right)^TE\\ & & &= \left(1\odot y + x\odot 0\right)^TE\\ & & &= y^TE\\ Y &= A(x\odot y) & d_x(Y) &= A\left(d(x\odot y)\right)\\ & & &=A\left(1\odot y + x\odot 0\right)\\ & & &=Ay\\ \end{align*}
Which implies \begin{align*} d_x((x\odot y)^TA(x\odot y))&=y^T\odot A(x\odot y) + \left( (x\odot y)^T Ay \right)^T\\ &=y^T\odot A(x\odot y) + y^TA^T(x\odot y) \end{align*} as a derivative. But the correct derivative should be
$$2y\odot A(x\odot y)$$
What is missing?
Solution 1:
New episode of the misdeeds of the Matrix-Cookbook.
If $f(x)=x^\top Ax$, the derivative is the linear application \begin{align*} Df_x: h\in\mathbb{R}^n\rightarrow h^\top Ax+x^\top Ah=(x^\top A^\top+x^\top A)h \end{align*} and the gradient $\nabla f(x)$ is the vector defined, for every $h$, by the relation \begin{align*}Df_x(h)=\langle \nabla f(x),h\rangle={\nabla f(x)}^\top h. \end{align*} Thus $\nabla f(x)=(A+A^\top)x$.
The Hadamard product $a\odot b$ is bilinear and the derivative satisfies \begin{align*}\mathrm{d}(a\odot b)=\mathrm{d}a\odot b+a \odot \mathrm{d}b \end{align*} like a standard product of matrices. For instance if $f(x)=(x\odot y)^\top A(x\odot y)$, then \begin{align*} Df_x:h\rightarrow (h\odot y)^\top & A(x\odot y)+(x\odot y)^\top A(h\odot y)\\&=[(x\odot y)^\top A+(x\odot y)^\top A^\top](y\odot h) \\&=([(A+A^\top)(x\odot y)]\odot y)^\top h \\&=\langle [(A+A^\top)(x\odot y)]\odot y,h \rangle \end{align*} because $\langle u, v\odot w \rangle=\langle u\odot v, w \rangle$ and $\nabla f(x)=[(A+A^\top)(x\odot y)]\odot y$.
EDIT. Answer to @hans.
Concerning the derivative or the gradient, the standard notation is as follows. Let \begin{align*}f: X=(x,y)\in \Omega\subset \mathbb{R}^p\times\mathbb{R}^q\rightarrow f(X)\in \mathbb{R}^n. \end{align*} Note that $\mathrm{d}f_X, DF_X, {\mathrm{d}f(X)}/{\mathrm{d}X}$ refer to the same concept: the total differential or the total derivative in $X$; it is a linear application $(h,k)\in\mathbb{R}^p\times\mathbb{R}^q\rightarrow \mathbb{R}^n$. In particular, in the formula \begin{align*}\frac{\mathrm{d}f}{\mathrm{d}X}=\frac{\partial f}{\partial x}\mathrm{d}x+\frac{\partial f}{\partial y}\mathrm{d}y, \end{align*} the linear applications $\mathrm{d}x, \mathrm{d}y$, are defined as $\mathrm{d}x:(h,k)\rightarrow h$ and $\mathrm{d}y:(h,k)\rightarrow k$. The "partial derivative" $\partial f(X)/{\partial x}:\mathbb{R}^P\rightarrow\mathbb{R}^n$ is also a linear application.
For the case $n=1$, we can define the gradient of $f$ by duality, using the scalar product $\langle H, K \rangle=H^\top K$ for vectors or $\langle H, K \rangle=\mathrm{trace}(H^\top K)$ for square matrices (cf. the beginning of the post).
With our notation, \begin{align*}Df_X(h,k)=\dfrac{\partial f(X)}{\partial x}h+\dfrac{\partial f(X)}{\partial y}k=\left[\dfrac{\partial f(X)}{\partial x},\dfrac{\partial f(X)}{\partial y}\right][h,k]^\top. \end{align*} Thus, \begin{align*}\nabla(f)(X)=\left[\frac{\partial f(X)}{\partial x}, \frac{\partial f(X)}{\partial y}\right],\end{align*} that is, the transpose of the Jacobian matrix of $f$.
Of course, the calculation in your post is correct but you calculate the gradient, not the differential nor the derivative.
Solution 2:
Your "known" result in the second example (with the Hadamard products) is wrong.
Define the diagonal matrix $$\eqalign{ M &= {\rm Diag}(y) = M^T \cr Mx &= (x\circ y) \cr x^TM &= (x\circ y)^T \cr }$$
Let's assume $y$ is constant and find the differential of your proposed function $$\eqalign{ f &= (x\circ y)^TA(x\circ y) \cr &= x^TMAMx \cr\ &= x^TBx \cr }$$ As you state in your first example, the derivative of this function is well-known $$\eqalign{ \frac{\partial f}{\partial x} &= (B^T+B)\,x \cr &= M(A^T+A)\,Mx \cr &= y\circ(A^T+A)\,(x\circ y) \cr }$$ Also, you seem to be confused about the difference between a differential and a derivative. If $y$ is a vector, then $dy$ is also a vector, but $(\frac{dy}{dx})$ is a matrix. It is incorrect to replace differentials with derivatives in an equation; you must properly account for the tensorial character of each term in the equation.