Gradient of linear scalar field with respect to matrix

I am following the book Mathematics for Machine Learning to study the math necessary to understand machine learning papers. On page 158, the authors list these results about gradients of scalar fields with respect to matrices and vectors:

$$ \frac{\partial \mathbf{a}^T\mathbf{x}}{\partial \mathbf{x}} = \mathbf{a}^T \tag{1} $$

$$ \frac{\partial \mathbf{a}^T\mathbf{X}\mathbf{b}}{\partial \mathbf{X}} = \mathbf{a}\mathbf{b}^T \tag{2} $$

The book follows numerator layout.

I am slightly confused about the dimension of the results. Assume $\mathbf{a} \in \Bbb R^n$, $\mathbf{X} \in \Bbb R^{n\times m}$, $\mathbf{b} \in \Bbb R^m$, where $m = 1$ in the first equation. Yet the dimension of the first one is $m\times n\ (m = 1)$ while the dimension of the second one is $n \times m$. Why is that?


I would say that the answer from Ted is wrong...

In both cases you consider, you have a scalar function $\phi$ that takes either a vector or a matrix in input and returns a scalar. Using the Frobenius inner product (denoted by the colon operator), computing a vector/matrix derivative is nothing more than expressing the differential form of $\phi$ in a special form and proceed by identification.

For vector-valued functions, write $d\phi$ as $d\phi = \mathbf{a}:d\mathbf{x} $. By identification, $$ \frac{\partial \phi}{\partial \mathbf{x}} = \mathbf{a} $$

For matrix-valued functions, write $d\phi$ as $d\phi = \mathbf{A}:d\mathbf{X} $. By identification, $$ \frac{\partial \phi}{\partial \mathbf{X}} = \mathbf{A} $$

So in your first example, $\phi(\mathbf{x}) = \mathbf{a}: \mathbf{x}$ Thus $d\phi=\mathbf{a}: d\mathbf{x}$ and the derivative is the vector $\mathbf{a}$.

In the second example $\phi(\mathbf{X}) = \mathbf{a}: \mathbf{Xb} = \mathbf{a}\mathbf{b}^T: \mathbf{X} $ Thus $d\phi=\mathbf{a}\mathbf{b}^T: d\mathbf{X}$ and the derivative is the matrix $\mathbf{a}\mathbf{b}^T$.