Reverse mode differentiation vs. forward mode differentiation - where are the benefits?

Solution 1:

The chain rule states that to compute the Jacobian of an operation we should multiply the Jacobians of all sub-operations together. The difference between forward and reverse differentiation is the order in which we multiply those Jacobians.

In your case you only have two sub-operations: $xy$ and $\sin()$, leading to only one matrix multiplication, so it isn't really instructive. However, let's consider an operation with 3 sub-operations. Take the function: $$ \mathbf{y} = h(g(f(\mathbf{x}))) $$ where $\bf{x}$ and $\bf{y}$ are vectors of different lengths. We can break this down into: $$ \mathbf{a} = f(\mathbf{x}),~~~~ \mathbf{b} = g(\mathbf{a}),~~~~ \mathbf{y} = h(\mathbf{b}). $$ This gives us the Jacobian $$ \underbrace{\frac{\partial \mathbf{y}}{\partial \mathbf{x}}}_{|\mathbf{y}|\times|\mathbf{x}|} = \underbrace{\frac{\partial h(\mathbf{b})}{\partial \mathbf{b}}}_{|\mathbf{y}|\times|\mathbf{b}|} \underbrace{\frac{\partial g(\mathbf{a})}{\partial \mathbf{a}}}_{|\mathbf{b}|\times|\mathbf{a}|} \underbrace{\frac{\partial f(\mathbf{x})}{\partial \mathbf{x}}}_{|\mathbf{a}|\times|\mathbf{x}|}, $$ with the size of each matrix noted below the matrix. The time taken to compute each of those intermediate Jacobians is fixed, but the order in which we multiply them changes the number of operations required to do so. Forward differentiation would compute $$ \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial h(\mathbf{b})}{\partial \mathbf{b}}\left(\frac{\partial g(\mathbf{a})}{\partial \mathbf{a}}\frac{\partial f(\mathbf{x})}{\partial \mathbf{x}}\right), $$ which involves $|\mathbf{x}|\cdot|\mathbf{a}|\cdot|\mathbf{b}|+|\mathbf{x}|\cdot|\mathbf{b}|\cdot|\mathbf{y}|$ multiplications*. In contrast, reverse differentiation would compute $$ \frac{\partial \bf{y}}{\partial \bf{x}} = \left(\frac{\partial h(\bf{b})}{\partial \bf{b}}\frac{\partial g(\bf{a})}{\partial \bf{a}}\right)\frac{\partial f(\bf{x})}{\partial \bf{x}}, $$ which involves $|\mathbf{y}|\cdot|\mathbf{a}|\cdot|\mathbf{b}|+|\mathbf{y}|\cdot|\mathbf{a}|\cdot|\mathbf{x}|$ multiplications.

Assuming for simplicity that $|\mathbf{a}|=|\mathbf{b}|$, in the case that $|\mathbf{y}|\gt|\mathbf{x}|$, we can see that forward differentiation results in fewer operations. Similarly, if $|\mathbf{y}|\lt|\mathbf{x}|$ then reverse differentiation results in fewer operations.

Extending this reasoning to a longer chain of Jacobians, we can see that the optimal ordering of matrix multiplications need not be fully forwards or reverse differentiation, but that hybrid schemes can also result in fewer operations.


*The number of scalar multiplications required to multiply two matrices of sizes $a\times b$ and $b\times c$ is $a\cdot b\cdot c$.

Solution 2:

An analogy might help. Let $\bf A$, $\bf B$, and $\bf C$ be matrices with dimensions such that $\bf ABC$ is well defined. There are two obvious ways to compute this product, represented by $(\bf AB)\bf C$ and $\bf A(\bf BC)$. Which of those will require fewer multiplications and additions depends on the dimensions of the matrices. For example, if $\bf C$ has width 1 then the second form will be faster, or at least no slower. It's efficient to multiply by thin or short matrices early.

If $\bf A$, $\bf B$, and $\bf C$ correspond to Jacobians of various steps in the computation graph, then I believe $\bf C$ having width $1$ corresponds to the case when $m=1$.