Why the built-in lm function is so slow in R?
You are overlooking that
-
solve()
only returns your parameters -
lm()
returns you a (very rich) object with many components for subsequent analysis, inference, plots, ... - the main cost of your
lm()
call is not the projection but the resolution of the formulay ~ .
from which the model matrix needs to be built.
To illustrate Rcpp we wrote a few variants of a function fastLm()
doing more of what lm()
does (ie a bit more than lm.fit()
from base R) and measured it. See e.g. this benchmark script which clearly shows that the dominant cost for smaller data sets is in parsing the formula and building the model matrix.
In short, you are doing the Right Thing by using benchmarking but you are doing it not all that correctly in trying to compare what is mostly incomparable: a subset with a much larger task.
Something wrong with your benchmarking
It is really surprising that no one observed this!
You have used t(X) %*% X
inside solve()
. You should use crossprod(X)
, as X'X
is a symmetric matrix. crossprod()
only computes half of the matrix while copying the rest. %*%
forces computing all. So crossprod()
will be two times faster. This explains why in your benchmarking, you have roughly the same timing between solve()
and lm.fit()
.
On my old Intel Nahalem (2008 Intel Core 2 Duo), I have:
X <- matrix(runif(1000*1000),1000)
system.time(t(X)%*%X)
# user system elapsed
# 2.320 0.000 2.079
system.time(crossprod(X))
# user system elapsed
# 1.22 0.00 0.94
If your computer is faster, try using X <- matrix(runif(2000*2000),2000)
instead.
In the following, I will explain the computational details involved in all fitting methods.
QR factorization v.s. Cholesky factorization
lm()
/ lm.fit()
is QR based, while solve()
is Cholesky based. The computational costs of QR decomposition is 2 * n * p^2
, while Cholesky method is n * p^2 + p^3
(n * p^2
for computing matrix cross product, p^3
for Cholesky decomposition). So you can see that when n
is much much greater than p
, Cholesky method is about 2 times faster than QR method. So there is really no need to benchmark here. (in case you don't know, n
is the number of data, p
is the number of parameters.)
LINPACK QR v.s. LAPACK QR
Typically, lm.fit()
uses (modified) LINPACK
QR factorization algorithm, rather than LAPACK
QR factorization algorithm. Maybe you are not very familiar with BLAS/LINPACK/LAPACK
; these are FORTRAN code providing kernel scientific matrix computations. LINPACK
calls level-1 BLAS, while LAPACK
calls level-3 BLAS
using block algorithms. On average, LAPACK
QR is 1.6 times faster than LINPACK
QR. The critical reason that lm.fit()
does not use LAPACK
version, is the need for partial column pivoting. LAPACK
version does full column pivoting, making it more difficult for summary.lm()
to use the R
matrix factor of QR decomposition to produce F-statistic and ANOVA
test.
pivoting v.s. no pivoting
fastLm()
from RcppEigen
package uses LAPACK
non-pivoted QR factorization. Again, you may be unclear about the QR factorization algorithm and pivoting issues. You only need to know that QR factorization with pivoting has only 50% share of level-3 BLAS
, while QR factorization without pivoting has 100% share of level-3 BLAS
. In this regard, giving up pivoting will speed up QR factorization process. Surely, the final result is different, and when the model matrix is rank deficient, no pivoting gives dangerous result.
There is a good question related to the different result you get from fastLM
: Why does fastLm()
return results when I run a regression with one observation?. @BenBolker, @DirkEddelbuettel and I had a very brief discussion in comments of Ben's answer.
Conclusion: Do you want speed, or numerical stability?
In terms of numerical stability, there is:
LINPACK pivoted QR > LAPACK pivoted QR > pivoted Cholesky > LAPACK non-pivoted QR
In terms of speed, there is:
LINPACK pivoted QR < LAPACK pivoted QR < pivoted Cholesky < LAPACK non-pivoted QR
As Dirk said,
FWIW the RcppEigen package has a fuller set of decompositions in its
fastLm()
example. But here it is as Ben so eloquently stated: "this is part of the price you pay for speed.". We give you enough rope to hang yourself. If you want to protect yourself from yourself, stick withlm()
orlm.fit()
, or create a hybrid 'fast-but-safe' version.
Fast and stable version
Check my answer Here.