Differentiation with respect to a matrix (residual sum of squares)?
I've never heard of differentiating with respect to a matrix. Let $\mathbf{y}$ be a $N \times 1$ vector, $\mathbf{X}$ be a $N \times p$ matrix, and $\beta$ be a $p \times 1$ vector. Then the residual sum of squares is defined by $$\text{RSS}(\beta) = \left(\mathbf{y}-\mathbf{X}\beta\right)^{T}\left(\mathbf{y}-\mathbf{X}\beta\right)\text{.}$$ The Elements of Statistical Learning, 2nd ed., p. 45, states that when we differentiate this with respect to $\beta$, we get $$\begin{align} &\dfrac{\partial\text{RSS}}{\partial \beta} = -2\mathbf{X}^{T}\left(\mathbf{y}-\mathbf{X}\beta\right) \\ &\dfrac{\partial^2\text{RSS}}{\partial \beta\text{ }\partial \beta^{T}} = 2\mathbf{X}^{T}\mathbf{X}\text{.} \end{align}$$ I mean, I could look at $\mathbf{y}$ and $\mathbf{X}$ as "constants" and $\beta$ as a variable, but it's unclear to me where the $-2$ in $\dfrac{\partial\text{RSS}}{\partial \beta}$ comes from, and why we would use $\beta^T$ for the second partial.
Any textbooks that cover this topic would be appreciated as well.
Side note: this is not homework. Please note that I graduated with an undergrad degree only, so assume that I've seen undergraduate real analysis, abstract algebra, and linear algebra for my pure mathematics background.
So, what you have here is basically a functional. You're inputting a matrix ($\mathbf{X}$) and a couple vectors ($\mathbf{y}$ and $\beta$), then combining them in such a way that the output is just a number. So, what we need here is called a functional derivative.
Let $\epsilon > 0$ and $\gamma$ be an arbitrary $p \times 1$ vector, then $$\frac{\partial \text{RSS}}{\partial \beta} \equiv \lim_{\epsilon \to 0} \Big((\epsilon \gamma^T)^{-1}\big(\text{RSS}(\beta + \epsilon \gamma) - \text{RSS}(\beta)\big) \Big). $$
We're adding a small, arbitrary vector to $\beta$ and then seeing how that changes $\text{RSS}$. We 'divide' out this arbitrary vector at the end, and I've used the transpose here because $\beta$ and $\gamma$ enter the original functional as multiplication from the right, so coming from the left we use the transpose. All that is left is to evaluate these expressions.
$$\text{RSS}(\beta+\epsilon\gamma) = \left(\mathbf{y}-\mathbf{X}(\beta+\epsilon\gamma)\right)^{T}\left(\mathbf{y}-\mathbf{X}(\beta+\epsilon\gamma)\right) = \left((\mathbf{y}-\mathbf{X}\beta)^{T}-(\mathbf{X}\epsilon\gamma)^T)\right)\left((\mathbf{y}-\mathbf{X}\beta)-\mathbf{X}\epsilon\gamma)\right) $$ $$= (\mathbf{y}-\mathbf{X}\beta)^{T}(\mathbf{y}-\mathbf{X}\beta)-(\mathbf{y}-\mathbf{X}\beta)^{T}\mathbf{X}\epsilon\gamma-(\mathbf{X}\epsilon\gamma)^T(\mathbf{y}-\mathbf{X}\beta)+(\mathbf{X}\epsilon\gamma)^T\mathbf{X}\epsilon\gamma $$ $$=\text{RSS}(\beta)- \epsilon \big((\mathbf{y}-\mathbf{X}\beta)^{T}\mathbf{X}\gamma+(\mathbf{X}\gamma)^T(\mathbf{y}-\mathbf{X}\beta)\big) + \epsilon^2 (\mathbf{X}\gamma)^T\mathbf{X}\gamma $$ So, $$\frac{\text{RSS}(\beta + \epsilon \gamma) - \text{RSS}(\beta)}{\epsilon \gamma^T} = \frac{-\big((\mathbf{y}-\mathbf{X}\beta)^{T}\mathbf{X}\gamma+(\mathbf{X}\gamma)^T(\mathbf{y}-\mathbf{X}\beta)\big) + \epsilon (\mathbf{X}\gamma)^T\mathbf{X}\gamma}{\gamma^T}. $$
The third term, than, does not survive in the limit and we are left with $$\frac{-\big((\gamma^T \mathbf{X}^T(\mathbf{y}-\mathbf{X}\beta))+(\gamma^T \mathbf{X}^T(\mathbf{y}-\mathbf{X}\beta))^T\big)}{\gamma^T} $$
However, since both of these terms are just $1 \times 1$ matrices, A.K.A. scalars, then the term and its transpose are equal and we are left with $$\frac{\partial \text{RSS}}{\partial \beta} = -2 \mathbf{X}^T(\mathbf{y}-\mathbf{X}\beta) $$
Wow, I asked this two years ago!
Since then, I've learned what the notation means for quick computational purposes.
Let $$\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix}$$ $$\mathbf{X} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \vdots \\ x_{N1} & x_{N2} & \cdots & x_{Np} \end{bmatrix}$$ and $$\beta = \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_p \end{bmatrix}\text{.}$$ Then $\mathbf{X}\beta \in \mathbb{R}^N$ and $$\mathbf{X}\beta = \begin{bmatrix} \sum_{j=1}^{p}b_jx_{1j} \\ \sum_{j=1}^{p}b_jx_{2j} \\ \vdots \\ \sum_{j=1}^{p}b_jx_{Nj} \end{bmatrix} \implies \mathbf{y}-\mathbf{X}\beta=\begin{bmatrix} y_1 - \sum_{j=1}^{p}b_jx_{1j} \\ y_2 - \sum_{j=1}^{p}b_jx_{2j} \\ \vdots \\ y_N - \sum_{j=1}^{p}b_jx_{Nj} \end{bmatrix} \text{.}$$ Therefore, $$(\mathbf{y}-\mathbf{X}\beta)^{T}(\mathbf{y}-\mathbf{X}\beta) = \|\mathbf{y}-\mathbf{X}\beta \|^2 = \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)^2\text{.} $$ We have, for each $k = 1, \dots, p$, $$\dfrac{\partial \text{RSS}}{\partial b_k} = 2\sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)(-x_{ik}) = -2\sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{ik}\text{.}$$ Then $$\begin{align}\dfrac{\partial \text{RSS}}{\partial \beta} &= \begin{bmatrix} \dfrac{\partial \text{RSS}}{\partial b_1} \\ \dfrac{\partial \text{RSS}}{\partial b_2} \\ \vdots \\ \dfrac{\partial \text{RSS}}{\partial b_p} \end{bmatrix} \\ &= \begin{bmatrix} -2\sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i1} \\ -2\sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i2} \\ \vdots \\ -2\sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{ip} \end{bmatrix} \\ &= -2\begin{bmatrix} \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i1} \\ \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i2} \\ \vdots \\ \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{ip} \end{bmatrix} \\ &= -2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)\text{.} \end{align}$$ For the second partial, as one might guess: $$\begin{align} \dfrac{\partial \text{RSS}}{\partial \beta^{T}} &= \begin{bmatrix} \dfrac{\partial \text{RSS}}{\partial b_1} & \dfrac{\partial \text{RSS}}{\partial b_2} & \cdots & \dfrac{\partial \text{RSS}}{\partial b_p} \end{bmatrix} \\ &= -2\begin{bmatrix} \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i1} & \cdots & \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{ip} \end{bmatrix} \end{align}$$ Now we "stack" to take the partial with respect to $\beta$: $$\begin{align} \dfrac{\partial^2\text{RSS}}{\partial \beta\text{ }\partial\beta^{T}} &= \dfrac{\partial}{\partial\beta}\left(\dfrac{\partial \text{RSS}}{\partial \beta^{T}} \right) \\ &= \begin{bmatrix} -2\cdot \dfrac{\partial}{\partial b_1}\begin{bmatrix} \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i1} & \cdots & \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{ip} \end{bmatrix} \\ \vdots \\ -2\cdot \dfrac{\partial}{\partial b_p}\begin{bmatrix} \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i1} & \cdots & \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{ip} \end{bmatrix} \end{bmatrix} \\ &= \begin{bmatrix} -2\begin{bmatrix} -\sum_{i=1}^{N}x_{i1}^2 & \cdots & -\sum_{i=1}^{N}x_{i1}x_{ip} \end{bmatrix} \\ \vdots \\ -2\begin{bmatrix} -\sum_{i=1}^{N}x_{i1}x_{ip} & \cdots & -\sum_{i=1}^{N}x_{ip}^2 \end{bmatrix} \end{bmatrix} \\ &= 2\mathbf{X}^{T}\mathbf{X}\text{.} \end{align}$$