Why the built-in lm function is so slow in R?

You are overlooking that

  • solve() only returns your parameters
  • lm() returns you a (very rich) object with many components for subsequent analysis, inference, plots, ...
  • the main cost of your lm() call is not the projection but the resolution of the formula y ~ . from which the model matrix needs to be built.

To illustrate Rcpp we wrote a few variants of a function fastLm() doing more of what lm() does (ie a bit more than lm.fit() from base R) and measured it. See e.g. this benchmark script which clearly shows that the dominant cost for smaller data sets is in parsing the formula and building the model matrix.

In short, you are doing the Right Thing by using benchmarking but you are doing it not all that correctly in trying to compare what is mostly incomparable: a subset with a much larger task.


Something wrong with your benchmarking

It is really surprising that no one observed this!

You have used t(X) %*% X inside solve(). You should use crossprod(X), as X'X is a symmetric matrix. crossprod() only computes half of the matrix while copying the rest. %*% forces computing all. So crossprod() will be two times faster. This explains why in your benchmarking, you have roughly the same timing between solve() and lm.fit().

On my old Intel Nahalem (2008 Intel Core 2 Duo), I have:

X <- matrix(runif(1000*1000),1000)
system.time(t(X)%*%X)
#    user  system elapsed 
#   2.320   0.000   2.079 
system.time(crossprod(X))
#    user  system elapsed 
#    1.22    0.00    0.94 

If your computer is faster, try using X <- matrix(runif(2000*2000),2000) instead.

In the following, I will explain the computational details involved in all fitting methods.


QR factorization v.s. Cholesky factorization

lm() / lm.fit() is QR based, while solve() is Cholesky based. The computational costs of QR decomposition is 2 * n * p^2, while Cholesky method is n * p^2 + p^3 (n * p^2 for computing matrix cross product, p^3 for Cholesky decomposition). So you can see that when n is much much greater than p, Cholesky method is about 2 times faster than QR method. So there is really no need to benchmark here. (in case you don't know, n is the number of data, p is the number of parameters.)


LINPACK QR v.s. LAPACK QR

Typically, lm.fit() uses (modified) LINPACK QR factorization algorithm, rather than LAPACK QR factorization algorithm. Maybe you are not very familiar with BLAS/LINPACK/LAPACK; these are FORTRAN code providing kernel scientific matrix computations. LINPACK calls level-1 BLAS, while LAPACK calls level-3 BLAS using block algorithms. On average, LAPACK QR is 1.6 times faster than LINPACK QR. The critical reason that lm.fit() does not use LAPACK version, is the need for partial column pivoting. LAPACK version does full column pivoting, making it more difficult for summary.lm() to use the R matrix factor of QR decomposition to produce F-statistic and ANOVA test.


pivoting v.s. no pivoting

fastLm() from RcppEigen package uses LAPACK non-pivoted QR factorization. Again, you may be unclear about the QR factorization algorithm and pivoting issues. You only need to know that QR factorization with pivoting has only 50% share of level-3 BLAS, while QR factorization without pivoting has 100% share of level-3 BLAS. In this regard, giving up pivoting will speed up QR factorization process. Surely, the final result is different, and when the model matrix is rank deficient, no pivoting gives dangerous result.

There is a good question related to the different result you get from fastLM: Why does fastLm() return results when I run a regression with one observation?. @BenBolker, @DirkEddelbuettel and I had a very brief discussion in comments of Ben's answer.


Conclusion: Do you want speed, or numerical stability?

In terms of numerical stability, there is:

LINPACK pivoted QR > LAPACK pivoted QR > pivoted Cholesky > LAPACK non-pivoted QR

In terms of speed, there is:

LINPACK pivoted QR < LAPACK pivoted QR < pivoted Cholesky < LAPACK non-pivoted QR

As Dirk said,

FWIW the RcppEigen package has a fuller set of decompositions in its fastLm() example. But here it is as Ben so eloquently stated: "this is part of the price you pay for speed.". We give you enough rope to hang yourself. If you want to protect yourself from yourself, stick with lm() or lm.fit(), or create a hybrid 'fast-but-safe' version.


Fast and stable version

Check my answer Here.