Screening (multi)collinearity in a regression model

The kappa() function can help. Here is a simulated example:

> set.seed(42)
> x1 <- rnorm(100)
> x2 <- rnorm(100)
> x3 <- x1 + 2*x2 + rnorm(100)*0.0001    # so x3 approx a linear comb. of x1+x2
> mm12 <- model.matrix(~ x1 + x2)        # normal model, two indep. regressors
> mm123 <- model.matrix(~ x1 + x2 + x3)  # bad model with near collinearity
> kappa(mm12)                            # a 'low' kappa is good
[1] 1.166029
> kappa(mm123)                           # a 'high' kappa indicates trouble
[1] 121530.7

and we go further by making the third regressor more and more collinear:

> x4 <- x1 + 2*x2 + rnorm(100)*0.000001  # even more collinear
> mm124 <- model.matrix(~ x1 + x2 + x4)
> kappa(mm124)
[1] 13955982
> x5 <- x1 + 2*x2                        # now x5 is linear comb of x1,x2
> mm125 <- model.matrix(~ x1 + x2 + x5)
> kappa(mm125)
[1] 1.067568e+16
>

This used approximations, see help(kappa) for details.

Just to add to what Dirk said about the Condition Number method, a rule of thumb is that values of CN > 30 indicate severe collinearity. Other methods, apart from condition number, include:

1) the determinant of the covariance matrix which ranges from 0 (Perfect Collinearity) to 1 (No Collinearity)

# using Dirk's example
> det(cov(mm12[,-1]))
[1] 0.8856818
> det(cov(mm123[,-1]))
[1] 8.916092e-09

2) Using the fact that the determinant of a diagonal matrix is the product of the eigenvalues => The presence of one or more small eigenvalues indicates collinearity

> eigen(cov(mm12[,-1]))$values
[1] 1.0876357 0.8143184

> eigen(cov(mm123[,-1]))$values
[1] 5.388022e+00 9.862794e-01 1.677819e-09

3) The value of the Variance Inflation Factor (VIF). The VIF for predictor i is 1/(1-R_i^2), where R_i^2 is the R^2 from a regression of predictor i against the remaining predictors. Collinearity is present when VIF for at least one independent variable is large. Rule of Thumb: VIF > 10 is of concern. For an implementation in R see here. I would also like to comment that the use of R^2 for determining collinearity should go hand in hand with visual examination of the scatterplots because a single outlier can "cause" collinearity where it doesn't exist, or can HIDE collinearity where it exists.

You might like Vito Ricci's Reference Card "R Functions For Regression Analysis" http://cran.r-project.org/doc/contrib/Ricci-refcard-regression.pdf

It succinctly lists many useful regression related functions in R including diagnostic functions. In particular, it lists the vif function from the car package which can assess multicollinearity. http://en.wikipedia.org/wiki/Variance_inflation_factor

Consideration of multicollinearity often goes hand in hand with issues of assessing variable importance. If this applies to you, perhaps check out the relaimpo package: http://prof.beuth-hochschule.de/groemping/relaimpo/

See also Section 9.4 in this Book: Practical Regression and Anova using R [Faraway 2002].

Collinearity can be detected in several ways:

Examination of the correlation matrix of the predictors will reveal large pairwise collinearities.
A regression of x_i on all other predictors gives R^2_i. Repeat for all predictors. R^2_i close to one indicates a problem — the offending linear combination may be found.
Examine the eigenvalues of t(X) %*% X, where X denotes the model matrix; Small eigenvalues indicate a problem. The 2-norm condition number can be shown to be the ratio of the largest to the smallest non-zero singular value of the matrix ($\kappa = \sqrt{\lambda_1/\lambda_p}$; see ?kappa); \kappa >= 30 is considered large.

Screening (multi)collinearity in a regression model

Related

Recent Posts