What's the intuition behind the conjugate gradient method?

I have been searching for an intuitive explanation of the conjugate gradient method (as it relates to gradient descent) for at least two years without luck.

I even find articles like "An Introduction to the Conjugate Gradient Method Without the Agonizing Pain" hard to understand.

Intuitively, what does this method do (e.g. geometrically) and why does it outperform gradient descent?


Just giving an intuitive overview. Michael Zubilevsky's lectures go through it in more technical detail.

Consider the quadratic objective $f(x) = x^T Q x + b^T x$. Geometrically, CG tries to "unwarp" some of the geometry induced by the quadratic form $Q$.

To build intuition, consider $Q \in \mathbb{R}^{2\times 2}$. If $Q = I$, then the contours of $f$ are circles. If $Q$ is diagonal, then the contours of $f$ are $x$-axis and $y$-axis aligned ellipses. If $Q$ has off-diagonal components as well, then there is "correlation" between the $x$ and $y$ directions, so the contours are ellipses whose principal axes are a combination of the $x$- and $y$-axis.

Now consider performing coordinate descent in the $Q=I$ or $Q$ diagonal case. We can reach the optimum in at most $2$ steps by viewing the problem as two $1$-dimensional minimization problems over each coordinate. We have this property because the $x$- and $y$-axis are orthogonal, and they "align" with the geometry of $Q$.

In the case where $Q$ contains off-diagonal components, CG finds orthogonal directions in $Q$'s geometry so that we can perform coordinate descent. We can use Gram-Schmidt to do the orthogonalization to find orthogonal directions, but if we start by orthogonalizing against the gradient, then we observe a lot of simplification in the update rule.

It outperforms gradient descent because by orthogonality of search directions, we ensure that we're never repeating a step along a direction we've searched in before. Thus we get rid of the "zig-zagging" of GD.

[EDIT: some additional notes here]


Check the full version of

Shewchuk (1994) An Introdution to the Conjugate Gradient Method without Pain

This pdf is a 64 page document with 40+ figures (full of geometric insights). The version you got is just a 17 page version of the full document, without figures.