Derivative of a particular matrix valued function with respect to a vector
I am reading a section of a book regarding linear regression and came across a derivation that I could not follow.
It starts with a loss function:
$\mathcal{L}(\textbf{w},S) = (\textbf{y}-\textbf{X}\textbf{w})^\top(\textbf{y}-\textbf{X}\textbf{w})$
and then states that "We can seek the optimal $\textbf{w}$ by taking the derivatives of the loss with respect to $\textbf{w}$ and setting them to the zero vector"
$\frac{\partial\mathcal{L}(\textbf{w},S)}{\partial\textbf{w}} = -2\textbf{X}^{\top}\textbf{y} + 2\textbf{X}^\top\textbf{X}\textbf{w} = \textbf{0}$
How is this derivative being calculated? I find that I have no idea how to take the derivative of vector or matrix valued functions, especially when the derivative is with respect to a vector, however I found a pdf ( http://orion.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf ) which appears to address some of my questions, yet my attempts at taking the derivative of the loss function seem to be missing a transpose and thus does not reduce as nicely as the books result.
Solution 1:
The definition of the derivative can be found at http://en.wikipedia.org/wiki/Fr%C3%A9chet_derivative.
In this case, the derivative can be computed directly by expanding the function: $$\mathcal{L}(w+\delta,S)= \langle y -X(w+\delta), y -X(w + \delta) \rangle = \mathcal{L}(w,S)+2 \langle y -Xw,-X\delta\rangle+ || X \delta||^2.$$ The second term can be written as $2 \langle -X^T(y -Xw),\delta\rangle = \langle -2X^Ty +2X^TXw,\delta\rangle$, from which it follows that the (Fréchet) derivative is $ \frac{\partial \mathcal{L}(w,S)}{\partial w} = (-2X^Ty +2X^TXw)^T$.
The derivative can also be computed componentwise, but requires more bookkeeping.
The expression you have for the partial is missing a transpose.
Solution 2:
$ \def\l{{\cal L}}\def\p{\partial} \def\bR#1{\big(#1\big)} \def\grad#1#2{\frac{\p #1}{\p #2}} $Substituting $\,v=\bR{Xw-y}\,$ simplifies the loss function to the point that differentiation is trivial. $$\eqalign{ \l &= v^T v \\ d\l &= 2v^Tdv = 2v^T\bR{X\,dw} = \bR{2X^Tv}^Tdw \\ \grad{\l}{w} &= 2X^Tv = 2X^T\bR{Xw-y} \\\\ }$$