Least squares / residual sum of squares in closed form
In finding the Residual Sum of Squares (RSS) We have:
\begin{equation} \hat{Y} = X^T\hat{\beta} \end{equation}
where the parameter $\hat{\beta}$ will be used in estimating the output value of input vector $X^T$ as $\hat{Y}$
\begin{equation} RSS(\beta) = \sum_{i=1}^n (y_i - x_i^T\beta)^2 \end{equation}
which in matrix form would be
\begin{equation} RSS(\beta) = (y - X \beta)^T (y - X \beta) \end{equation}
differentiating w.r.t $\beta$ we get
\begin{equation} X^T(y - X\beta) = 0 \end{equation}
My question is how is the last step done? How did the derivative get the last equation?
Solution 1:
According to Randal J. Barnes, Matrix Differentiation, Prop. 7, if $\alpha=y^TAx$ where $y$ and $x$ are vectors and $A$ is a matrix, we have $$\frac{\partial\alpha}{\partial x}=y^TA\text{ and }\frac{\partial\alpha}{\partial y}=x^TA^T$$ (the proof is very simple). Also according to his Prop. 8, if $\alpha=x^TAx$ then $$\frac{\partial \alpha}{\partial x}=x^T(A+A^T). $$ Therefore in Alecos's solution above, I would rather write $$ \frac{\partial\mathrm{RSS}(\beta)}{\partial\beta}=-y^TX-y^TX+\beta^T(X^TX+XX^T) $$ where the last term is indeed $2\beta^TX^TX$ since $X^TX$ is symmetric and hence $X^TX=XX^T$. This gives us an equation $$ (y^T+b^TX^T)X=0 $$ which provides the same result as in Alecos's answer, if we take the transpose of both sides. I guess Alecos has used a different definition of matrix differentiation than Barnes, but the final result is, of course, correct.
Solution 2:
This is standard multiplication and differentiation rules for matrices.
We have
$$RSS(\beta) = (y - X \beta)^T (y - X \beta) = (y^T - \beta^TX^T)(y - X \beta) \\ =y^Ty-y^TX \beta-\beta^TX^Ty+\beta^TX^TX \beta$$
Then $$\frac {\partial RSS(\beta)}{\partial \beta} = -X^Ty-X^Ty+2X^TX\beta$$
the last term because the matrix $X^TX$ is symmetric.
So $$\frac {\partial RSS(\beta)}{\partial \beta} =0 \Rightarrow -2X^Ty+2X^TX\beta =0 \Rightarrow -X^Ty+X^TX\beta = 0$$
$$\Rightarrow X^T(-y + X\beta) = 0\Rightarrow X^T(y-X\beta)=0$$
Solution 3:
This is a repeat of my answer here.
Let $$\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix}$$ $$\mathbf{X} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \vdots \\ x_{N1} & x_{N2} & \cdots & x_{Np} \end{bmatrix}$$ and $$\beta = \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_p \end{bmatrix}\text{.}$$ Then $\mathbf{X}\beta \in \mathbb{R}^N$ and $$\mathbf{X}\beta = \begin{bmatrix} \sum_{j=1}^{p}b_jx_{1j} \\ \sum_{j=1}^{p}b_jx_{2j} \\ \vdots \\ \sum_{j=1}^{p}b_jx_{Nj} \end{bmatrix} \implies \mathbf{y}-\mathbf{X}\beta=\begin{bmatrix} y_1 - \sum_{j=1}^{p}b_jx_{1j} \\ y_2 - \sum_{j=1}^{p}b_jx_{2j} \\ \vdots \\ y_N - \sum_{j=1}^{p}b_jx_{Nj} \end{bmatrix} \text{.}$$ Therefore, $$(\mathbf{y}-\mathbf{X}\beta)^{T}(\mathbf{y}-\mathbf{X}\beta) = \|\mathbf{y}-\mathbf{X}\beta \|^2 = \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)^2\text{.} $$ We have, for each $k = 1, \dots, p$, $$\dfrac{\partial \text{RSS}}{\partial b_k} = 2\sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)(-x_{ik}) = -2\sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{ik}\text{.}$$ Then $$\begin{align}\dfrac{\partial \text{RSS}}{\partial \beta} &= \begin{bmatrix} \dfrac{\partial \text{RSS}}{\partial b_1} \\ \dfrac{\partial \text{RSS}}{\partial b_2} \\ \vdots \\ \dfrac{\partial \text{RSS}}{\partial b_p} \end{bmatrix} \\ &= \begin{bmatrix} -2\sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i1} \\ -2\sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i2} \\ \vdots \\ -2\sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{ip} \end{bmatrix} \\ &= -2\begin{bmatrix} \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i1} \\ \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i2} \\ \vdots \\ \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{ip} \end{bmatrix} \\ &= -2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)\text{.} \end{align}$$