Neural Network matrix calculus

There's indeed some confusion in the literature on which shape to use for derivative with respect to a matrix. This book helps makes sense of this -- "Matrix Differential Calculus with Applications in Statistics and Econometrics" by Jan R. Magnus, in particular, section 9.3 "Bad notation" and section 9.4 "Good notation"

The key point is that there are multiple ways to define derivatives when you start dealing with higher-rank inputs/outputs, with no single one that's significantly better.

There's one special case with an obvious "best" notation is in the case of derivative of scalar output with respect to vector or matrix input A. In such case it's best keep derivative in the same shape as A. The reason is that this lets you reuse standard formula for gradient descent:

$$W\leftarrow W-\alpha f'(W)$$

Personally, I think derivatives of vectors/matrices w.r.t matrix are best kept as tensors, otherwise you have to deal with flattening 4d tensor into 2d matrix, there is more than one way of doing it, and you have to think about the correct way each time. Magnus/Nuedecker advocate for a particular way of doing this, relying on vec operation to convert matrix to a vector. This operation is efficient for column-major memory layout, while modern deep learning systems like PyTorch default to row-major memory layout. Using vec operation in such system would give an inefficient representation.

To avoid higher rank tensors, or flattening, it's best to use notational technique called differentials. You can think of a differential as a linear function that approximates your target function. So it's very closely related to derivative, except derivative is a tensor object, whereas differential is a function.

For instance for function $Ax+b$, $f(x)=Ax$ is the differential, whereas $A$ is the derivative. For practical purposes, it doesn't matter if your final result is $A$ or $Ax$ because you can convert between the two forms.

The reason to keep it as differential is because it lets you do algebra with objects of lower rank. For instance consider the following matrix function of matrix $x$

$$f(x)_{ij}=\sum_{ijkl} A_{ijkl} x_{kl}$$

Its differential of this function, $f(x)$, has matrix shape, whereas its derivative, $A$, is a rank-4 tensor.

Differentials have an analogous chain rule to derivatives, and you can extract derivative out of final differential easily.

Let's get a sense of how your derivation would go in the world of differentials. We are interested in computing derivative of $f(g(W)$ where $f$ is our loss and $g$ is our predictor. Define functions $f,g$ and their differentials $df,dg$

$$\begin{array}{lll} f(y)&=&y^T y\\ df(y)&=&2y^T dy\\ g(W)&=&Wx\\ dg(W)&=&dW x \end{array} $$

Now compute differential of $f(g(W))$ using chain rule for differentials

$$\begin{array}{lll} d[f(g)]&=&df \cdot dg=\\ & = & 2y^T dW x \end{array} $$

Now we can see that differential is $2y^T dW x$, we need to extract derivative out of this. Standard trick is to write it as trace, use cyclic property of trace to turn it into form $\text{tr}(A'dX)$ and then extract derivative from this form.

$$\begin{array}{lll} 2 y^T dW x&=&\text{tr}(y^T dW x) = \\ & = & \text{tr}(x y^T dW) = \\ & = & \text{tr}(A' dW) \end{array}$$

In the last step, we defined $A=yx^T$. Now, once you bring your differential into form $\text{tr}(A' dW)$, the derivative is just $A$ and it has the same shape as $W$. So now we get that

$$E'=y x^T$$

To see the power of differentials, consider the ease with which various matrix derivatives are derived in "Collected matrix derivative results for forward and reverse mode AD" paper. Their dot notation $\dot{A}$ is equivalent to differential $\mathrm{d}A$. Derivatives of matrix inverse and determinant can be obtained in matrix form this way.

Below is a cheat-sheet of some standard matrix differential calculus results and exercises to get comfortable with notation, taken from "Matrix Differential Calculus with Applications in Statistics and Econometrics" (Magnus, Neudecker)


enter image description here


enter image description here


Let $|F|$ denote the determinant of $F$

enter image description here


enter image description here


enter image description here


enter image description here


enter image description here


Great answer by Yaroslav, to which I would like to add some comments about the trace function.

The trace has the advantage of being a familiar operation, however it is a clumsy and confusing notation for doing Matrix Calculus. $\,$ Inner product notations have far greater pedagogical value.

For example, given a matrix variable, function and gradient $$\eqalign{ \phi &= f(X),\qquad G = \frac{\partial \phi}{\partial X} \\ }$$ there are a variety of ways to write the differential
$$\eqalign{ d\phi &= G_{ij}\,dX_{ij} \\ &= G:dX \\ &= G\bullet dX \\ &= \big\langle G,\,dX\big\rangle \\ &= {\rm trace}\big(G^TdX\big) \\ &= {\rm trace}\big(G\,dX^T\big) \\ }$$ But take a closer look at those trace expressions.

  1. Why is there a function invocation?
  2. Why is there a transpose?
  3. Does the transpose operation go on the first or second term?

The cognitive benefits of using infix operators versus functional notation are obvious.
Indeed, it's the reason we use algebraic notation like this $$A + BC$$ instead of this $${\rm sum}\Big(A,\,{\rm prod}(B,C)\Big) \\$$