Why the order of Chain Rule all weird in matrix derivatives behind machine learning?
Suppose I have a neural network as the image.
And to make it simple, I'll set all the activation functions to $f(x)=x$.
So if we set the nodes as Row Vectors, $A_1$ is a $1 \times 2$ matrix, $W_1:2 \times 4, A_2:1 \times 4, W_2:4 \times 3,A_3:1 \times 3, W_3:3 \times 5,A_4:1 \times 5,\space W_4:5 \times 2,\space A_5:1 \times 2$ .
Then, $A_2=A_1W_1,\space A_3=A_2W_2,\space A_4=A_3W_3,\space A_5=A_4W_4$.
And the error function is, as usual, ${1 \over 2} (A_5-R)^2$, whereas $R$ is the target value ($T$ is reserved for matrix transposes).
So the partial derivatives of the error with respect to each weight is:
${\partial E \over \partial W_4}={\partial E \over \partial A_5}{\partial A_5 \over \partial W_4}=(A_5-R)A_4$
$A_5-R$ is a $1 \times 2$ Row Vector, and $A_4$ is a $1 \times 5$ Row Vector, so in order to make it work, the answer has to be $[(A_5-R)^TA_4]^T=A_4^T(A_5-R)$. OK, I often see people writing the chain rule in reverse order, so it's not that hard to understand.
Next, ${\partial E \over \partial W_3}={\partial E \over \partial A_5}{\partial A_5 \over \partial A_4}{\partial A_4 \over \partial W_3}=(A_5-R)W_4A_3$
The orders of the components are: $A_5-R:1 \times 2,\space W_4:5 \times 2,\space A_3:1 \times 3$.
Now we start to see a problem: $W_4$ and $A_3$ can NOT be multiplied by each other, no matter in which order!
$W_4$ has to multiply with $(A_5-R)$ first in the form of $(A_5-R)W_4^T$ or $W_4(A_5-R)^T$, and to make the final answer work, it has to be in the form of $A_3^T(A_5-R)W_4^T$
And if we keep going, we can see the derivative of the Error function $(A_5-R)$ is ALWAYS in the second one. eg. ${\partial E \over \partial W_k}=A_k^T(A_{n+1}-R) W_n^T...W_{k+1}^T$, $n=$ Layer of weights.
Even if we set the nodes to Column Vectors, it'll produce the same result, just in reverse order ($(A_5-R)$ being in the second to last element.)
$A_k$ is in reverse order, but $W_i$s are all in the original order of the Chain Rules, and the Error Function is even out of place (always the second one).
Why is that!?
Does it contain some hidden meanings I do not understand and therefore missed?
Or simply I just got it all wrong from the very beginning?
Thank you very much for your help!
The "correct way" to apply chain rule with matrices is to use differentials. Within this framework, it holds $$ dE = \frac{\partial E}{\partial \mathbf{A}_5}: d \mathbf{A}_5 $$ with the colon operator denoting the Frobenius inner product.
Since $\mathbf{A}_5=\mathbf{A}_4 \mathbf{W}_4$, you quickly obtain $$ dE = \mathbf{A}_4^T \frac{\partial E}{\partial \mathbf{A}_5}: d \mathbf{W}_4 $$ Thus by identification $$ \frac{\partial E}{\partial \mathbf{W}_4} = \mathbf{A}_4^T \frac{\partial E}{\partial \mathbf{A}_5} $$ with dimension $5\times 2$ (the size of $\mathbf{W}_4$).
The rest is very similar and easy to obtain.
UPDATE
Following the same line of reasoning, $$ \frac{\partial E}{\partial \mathbf{W}_3} = \mathbf{A}_3^T \frac{\partial E}{\partial \mathbf{A}_4} = \mathbf{A}_3^T \left[ \frac{\partial E}{\partial \mathbf{A}_5} \mathbf{W}_4^T \right] $$ with dimension $3\times 5$ (the size of $\mathbf{W}_3$).