What is the gradient of a matrix product AB?
Solution 1:
$ \def\d{\delta}\def\o{{\tt1}}\def\p{\partial} \def\L{\left}\def\R{\right}\def\LR#1{\L(#1\R)} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} $Renaming the variables from the first reference from $(Y,X,W)\to(C,A,B)$ makes it comparable to the second reference, i.e. the basic relationship is $$C=AB$$ In the second reference, there is a scalar cost function $z$ which is assumed to be a function of $C$. Not only that, but the gradient wrt $C$ is given by the following matrix $$G=\grad{z}{C}$$ Then it proposes using the chain rule is to calculate the gradient wrt $A$.
However, it is easier to write the differential, then change the independent variable from $C\to A$ $$\eqalign{ dz &= G:dC \\ &= G:\LR{dA\;B} \\ &= GB^T:dA \\ \grad{z}{A} &= GB^T \\ }$$ where $(:)$ denotes the matrix inner product, i.e. $$\eqalign{ X:Y &= \sum_{i=1}^m\sum_{j=1}^n X_{ij}Y_{ij} \;=\; \trace{X^TY} \\ X:X &= \big\|X\big\|^2_F \\ }$$ The properties of the underlying trace function allow the terms in such a product to be rearranged in many different but equivalent ways, e.g. $$\eqalign{ X:Y &= Y:X \\ X:Y &= X^T:Y^T \\ W:XY &= WY^T:X = X^TW:Y \\\\ }$$
Now back to the first reference. It is describing how to calculate something much more complicated $-$ the matrix-by-matrix gradient $\,\grad{C}{A}$
Once again, this can be calculated most easily using differentials $$\eqalign{ C &= AB \\ C_{ij} &= \sum_{p=\o}^D A_{ip}\,B_{pj} \\ dC_{ij} &= \sum_{p=\o}^D dA_{ip}\,B_{pj} \\ \grad{C_{ij}}{A_{\ell k}} &= \sum_{p=\o}^D \grad{A_{ip}}{A_{\ell k}}\;B_{pj} \\ &= \sum_{p=\o}^D \d_{i\ell}\,\d_{pk}\,B_{pj} \\ &= \d_{i\ell}\,B_{kj} \\ }$$ The PDF then sets $\ell=i$ to evaluate the remaining Kronecker delta symbol as $\o$, however leaving the delta symbol intact yields a more general (and useful) result.
As you read more, you will discover that the field of Machine Learning uses a hodge-podge of mathematical notations. Every book or article uses a different approach $-$ and most of them are terrible.