How can I derive the back propagation formula in a more elegant way?

As much as I love Clifford / geometric algebra, I don't think it is of use here. CA allows you to deal in a basis-free way with multidimensional subspaces as algebraic objects, making various complicated derivations more elegant and transparent. In the case of neural nets, however, we really are dealing with a plain vector of parameters. If $\theta \in \mathbb{R}^D$ is a list of parameters, it is indeed a vector because the sum of two parameter vectors is again a valid parameter vector and a scalar times a parameter vector is also a valid parameter vector. This vector does not naturally possess any additional structure that would make it into a multivector.

However, standard Linear Algebra, as practiced by mathematicians, does allow you to do basis-free computations with vectors. This can indeed make the derivation of backpropagation more transparent.

The essence of backprop

Basically, the backpropagation algorithm is an efficient way to compute the gradient of a function that is the composition of a number of functions: $$ L_\theta(x) = (f_N \circ f_{N-1} \circ \cdots \circ f_1)(x)$$ $L$ is called the loss, and it is just a chain of functions $f_i : \mathbb{R}^{D_{i-1}} \times \mathbb{R}^{P_i} \rightarrow \mathbb{R}^{D_i}$, of a $P_i$ dimensional parameter vector $\theta_i \in \mathbb{R}^{P_i}$ and the output $f_{i-1} \in \mathbb{R}^{D_{i-1}}$ of the previous layer.

Without choosing a basis, we see (using the total derivative) that the gradient of $L$ wrt the whole parameter vector $\theta$ is

$$ \begin{equation} \begin{aligned} \frac{d}{d \theta} f_N \circ \cdots \circ f_1 (x) \; &= \frac{\partial f_N}{\partial \theta_N} \frac{d \theta_N}{d \theta} + \frac{\partial f_N}{\partial f_{N-1}} \frac{d f_{N-1}}{d \theta} \\ &= \frac{\partial f_N}{\partial \theta_N} \frac{d \theta_N}{d \theta} + \frac{\partial f_N}{\partial f_{N-1}} \frac{\partial f_{N-1}}{\partial \theta_{N-1}} \frac{d \theta_{N-1}}{d\theta} + \frac{\partial f_N}{\partial f_{N-1}} \frac{\partial f_{N-1}}{\partial f_{N-2}} \frac{d f_{N-2}}{d\theta}\\ &= \cdots \end{aligned} \end{equation} $$ Do you see the pattern developing? You get products of the form $\frac{\partial f_N}{\partial f_{N-1}} \frac{\partial f_{N-1}}{\partial f_{N-2}} \cdots \frac{d f_{2}}{d f_1}$. Each factor in that product requires the output of the previous function (layer), so a dumb algorithm would recompute $f_i$ for each term that requires it. A better approach is to do a "forward pass" where we compute $f_1(x)$, remember it, then use it to compute $f_2(f_1(x))$, remember it, and so on.

Then there is the question of how to compute $\frac{\partial f_N}{\partial f_{N-1}} \frac{\partial f_{N-1}}{\partial f_{N-2}} \cdots \frac{d f_{2}}{d f_1}$ once we have all the values of $f_i$. Once we choose a basis (and we eventually have to in order to get numerical values), the factors $\frac{\partial f_{i}}{\partial f_{i-1}}$ are matrices, so we're talking about a big matrix multiplication. For example, if a mid-layer maps $\mathbb{R}^{1000}$ to $\mathbb{R}^{5000}$, then $\frac{\partial f_{i}}{\partial f_{i-1}}$ is a $5000 \times 1000$ matrix. The key insight of backprop is that the whole function $L$ has signature $L : \mathbb{R}^{D_0} \rightarrow \mathbb{R}$, so its much better to compute $$ \left(\left(\left( \frac{\partial f_N}{\partial f_{N-1}} \cdots \frac{\partial f_{3}}{\partial f_2} \right) \frac{\partial f_{2}}{\partial f_1} \right) \frac{d f_1}{d \theta}\right) $$ than $$ \left(\frac{\partial f_N}{\partial f_{N-1}} \cdots \left(\frac{\partial f_{3}}{\partial f_2} \left(\frac{\partial f_{2}}{\partial f_1} \frac{d f_1}{d \theta}\right)\right)\right) $$

The reason is that in the first case, we perform a sequence of matrix-vector multiplications. The result of the vector-matrix product $\frac{\partial f_{N}}{\partial f_{N-1}} \frac{\partial f_{N-1}}{\partial f_{N-2}}$ is again a vector, which we then multiply by $\frac{\partial f_{N-2}}{\partial f_{N-3}}$, and so on. The cost of each multiplication is $O(D^2)$. For the second case, one computes matrix-matrix products, which has complexity $O(D^3)$.

So we have established that backprop is the exploitation of associativity. In this light backprop seems like a trivial algorithm (compute the product in the right order..), but back in the 80s you could write a paper about it and get very famous :)

How can I derive the back propagation formula in a more elegant way?

Related

Recent Posts