Screening (multi)collinearity in a regression model
The kappa()
function can help. Here is a simulated example:
> set.seed(42)
> x1 <- rnorm(100)
> x2 <- rnorm(100)
> x3 <- x1 + 2*x2 + rnorm(100)*0.0001 # so x3 approx a linear comb. of x1+x2
> mm12 <- model.matrix(~ x1 + x2) # normal model, two indep. regressors
> mm123 <- model.matrix(~ x1 + x2 + x3) # bad model with near collinearity
> kappa(mm12) # a 'low' kappa is good
[1] 1.166029
> kappa(mm123) # a 'high' kappa indicates trouble
[1] 121530.7
and we go further by making the third regressor more and more collinear:
> x4 <- x1 + 2*x2 + rnorm(100)*0.000001 # even more collinear
> mm124 <- model.matrix(~ x1 + x2 + x4)
> kappa(mm124)
[1] 13955982
> x5 <- x1 + 2*x2 # now x5 is linear comb of x1,x2
> mm125 <- model.matrix(~ x1 + x2 + x5)
> kappa(mm125)
[1] 1.067568e+16
>
This used approximations, see help(kappa)
for details.
Just to add to what Dirk said about the Condition Number method, a rule of thumb is that values of CN > 30 indicate severe collinearity
. Other methods, apart from condition number, include:
1) the determinant of the covariance matrix which ranges from 0 (Perfect Collinearity) to 1 (No Collinearity)
# using Dirk's example
> det(cov(mm12[,-1]))
[1] 0.8856818
> det(cov(mm123[,-1]))
[1] 8.916092e-09
2) Using the fact that the determinant of a diagonal matrix is the product of the eigenvalues => The presence of one or more small eigenvalues indicates collinearity
> eigen(cov(mm12[,-1]))$values
[1] 1.0876357 0.8143184
> eigen(cov(mm123[,-1]))$values
[1] 5.388022e+00 9.862794e-01 1.677819e-09
3) The value of the Variance Inflation Factor (VIF). The VIF for predictor i is 1/(1-R_i^2), where R_i^2 is the R^2 from a regression of predictor i against the remaining predictors. Collinearity is present when VIF for at least one independent variable is large. Rule of Thumb: VIF > 10 is of concern
. For an implementation in R see here. I would also like to comment that the use of R^2 for determining collinearity should go hand in hand with visual examination of the scatterplots because a single outlier can "cause" collinearity where it doesn't exist, or can HIDE collinearity where it exists.
You might like Vito Ricci's Reference Card "R Functions For Regression Analysis" http://cran.r-project.org/doc/contrib/Ricci-refcard-regression.pdf
It succinctly lists many useful regression related functions in R including diagnostic functions.
In particular, it lists the vif
function from the car
package which can assess multicollinearity.
http://en.wikipedia.org/wiki/Variance_inflation_factor
Consideration of multicollinearity often goes hand in hand with issues of assessing variable importance. If this applies to you, perhaps check out the relaimpo
package: http://prof.beuth-hochschule.de/groemping/relaimpo/
See also Section 9.4 in this Book: Practical Regression and Anova using R [Faraway 2002].
Collinearity can be detected in several ways:
Examination of the correlation matrix of the predictors will reveal large pairwise collinearities.
A regression of x_i on all other predictors gives R^2_i. Repeat for all predictors. R^2_i close to one indicates a problem — the offending linear combination may be found.
Examine the eigenvalues of
t(X) %*% X
, whereX
denotes the model matrix; Small eigenvalues indicate a problem. The 2-norm condition number can be shown to be the ratio of the largest to the smallest non-zero singular value of the matrix ($\kappa = \sqrt{\lambda_1/\lambda_p}$; see?kappa
);\kappa >= 30
is considered large.