Conjugate Gradient Method and Sparse Systems
What is it about conjugate gradient that makes it useful for attacking sparse linear systems.
For any Krylov subspace method such as conjugate gradients, GMRES, MINRES, BiCG, etc., one needs to provide essentially only the routine which performs the matrix-vector product. It might be clear that performing matrix-vector products with sparse matrices is much cheaper than with dense matrices. In addition, these methods are useful for problems where the matrix itself does not exist "physically" in some form which makes direct solution methods based on matrix factorisations less attractive.
However, in order to make the iterative methods efficient one needs usually to provide also a good preconditioner which might be considered as an approximation of the action of the inverse of the system matrix/operator. Usually, preconditioners are based on incomplete factorisations, sparse inverse approximations, multi-level methods, etc. It is highly problem-dependent.
Nevertheless, the goal is to avoid problems which might arise with direct solution methods. For problems such as those arising in 3D discretizations, the direct methods can be expensive due to increased memory requirements because of the fill-in (factorisation of a sparse matrix can be much less sparse (denser) than the original matrix). Modern solution methods may combine both the direct and iterative approaches though.
Why would steepest descent be significantly worse?
The conjugate gradient (CG) and steepest descent (SD) methods have something in common, they minimise the $A$-norm of the error in $Ax=b$ over certain subspace. Note that this is equivalent to minimising the "energy" functional $f(x)=\frac{1}{2}(Ax,x)-(b,x)$. In SD, one updates the approximation to $x$ simply in the direction of the "steepest descent" of $f(x)$, that is, in the direction of the residual. It means that SD performs simply a one-dimensional minimisation. The implementation of CG does not differ too much from that of SD, but its "magic" relies on the fact that the 2 two-term recursive formula are (for the symmetric positive definite $A$) enough to build an orthogonal basis of the residuals forming the so called Krylov subspace (the span of $\{r_0,Ar_0,A^2r_0,\ldots\}$). As the outcome, the new residual vector is kept orthogonal to the subspace spanned by the previous residuals and thus minimising the $A$-norm of the error over the whole Krylov subspace of the dimension $k$ at the $k$-th iteration. Because of this, CG is (in theory) guaranteed to converge to the exact solution $x$ in at most $n$ iterations, where $n$ is the size of the system.
Since SD performs only one dimensional optimisation and CG minimises the error over the whole Krylov subspace of the dimension increasing with each iteration, SD can be expected to perform much worse than CG. A well known example for $n=2$ illustrates how bad SD can perform by tracking its path to the minimum of $f$ while CG converges to the minimum in (at most) two iterations!