What does the R function `poly` really do?

I have read through the manual page ?poly (which I admit I did not completely comphrehend) and also read the description of the function in book Introduction to Statistical Learning.

My current understanding is that a call to poly(horsepower, 2) should be equivalent to writing horsepower + I(horsepower^2). However, this seems to be contradicted by the output of the following code:

library(ISLR)

summary(lm(mpg~poly(horsepower,2), data=Auto))$coef

    #                       Estimate Std. Error   t value      Pr(>|t|)
    #(Intercept)            23.44592  0.2209163 106.13030 2.752212e-289
    #poly(horsepower, 2)1 -120.13774  4.3739206 -27.46683  4.169400e-93
    #poly(horsepower, 2)2   44.08953  4.3739206  10.08009  2.196340e-21

summary(lm(mpg~horsepower+I(horsepower^2), data=Auto))$coef

    #                    Estimate   Std. Error   t value      Pr(>|t|)
    #(Intercept)     56.900099702 1.8004268063  31.60367 1.740911e-109
    #horsepower      -0.466189630 0.0311246171 -14.97816  2.289429e-40
    #I(horsepower^2)  0.001230536 0.0001220759  10.08009  2.196340e-21

My question is, why don't the outputs match, and what is poly really doing?

Solution 1:

Raw polynomials

To get ordinary polynomials as in the question use raw = TRUE. Unfortunately there is an undesirable aspect with ordinary polynomials in regression. If we fit a quadratic, say, and then a cubic the lower order coefficients of the cubic are all different than for the quadratic, i.e. 56.900099702, -0.466189630, 0.001230536 for the quadratic vs. 6.068478e+01, -5.688501e-01, 2.079011e-03 after refitting with a cubic below.

library(ISLR)
fm2raw <- lm(mpg ~ poly(horsepower, 2, raw = TRUE), Auto)
cbind(coef(fm2raw))
##                                          [,1]
## (Intercept)                      56.900099702
## poly(horsepower, 2, raw = TRUE)1 -0.466189630
## poly(horsepower, 2, raw = TRUE)2  0.001230536

fm3raw <- lm(mpg ~ poly(horsepower, 3, raw = TRUE), Auto)
cbind(coef(fm3raw))
##                                           [,1]
## (Intercept)                       6.068478e+01
## poly(horsepower, 3, raw = TRUE)1 -5.688501e-01
## poly(horsepower, 3, raw = TRUE)2  2.079011e-03
## poly(horsepower, 3, raw = TRUE)3 -2.146626e-06

Orthogonal polynonials

What we would really like is to add the cubic term in such a way that the lower order coefficients that were fit using the quadratic stay the same after refitting with a cubic fit. To do this take linear combinations of the columns of poly(horsepower, 2, raw = TRUE) and do the same with poly(horsepower, 3, raw = TRUE) such that the columns in the quadratic fit are orthogonal to each other and similarly for the cubic fit. That is sufficient to guarantee that the lower order coefficients won't change when we add higher order coefficients. Note how the first three coefficients are now the same in the two sets below (whereas above they differ). That is, in both cases below the 3 lower order coefficients are 23.44592, -120.13774 and 44.08953 .

fm2 <- lm(mpg ~ poly(horsepower, 2), Auto)
cbind(coef(fm2))
##                            [,1]
## (Intercept)            23.44592
## poly(horsepower, 2)1 -120.13774
## poly(horsepower, 2)2   44.08953

fm3 <- lm(mpg ~ poly(horsepower, 3), Auto)
cbind(coef(fm3))
##                             [,1]
## (Intercept)            23.445918
## poly(horsepower, 3)1 -120.137744
## poly(horsepower, 3)2   44.089528
## poly(horsepower, 3)3   -3.948849

Same predictions

Importantly, since the columns of poly(horsepwer, 2) are just linear combinations of the columnns of poly(horsepower, 2, raw = TRUE) the two quadratic models (orthogonal and raw) represent the same models (i.e. they give the same predictions) and only differ in parameterization. For example, the fitted values are the same:

all.equal(fitted(fm2), fitted(fm2raw))
## [1] TRUE

This would also be true of the raw and orthogonal cubic models.

Orthogonality

We can verify that the polynomials do have orthogonal columns which are also orthogonal to the intercept:

nr <- nrow(Auto)
e <- rep(1, nr) / sqrt(nr) # constant vector of unit length
p <- cbind(e, poly(Auto$horsepower, 2))
zapsmall(crossprod(p))
##   e 1 2
## e 1 0 0
## 1 0 1 0
## 2 0 0 1

Residual sum of squares

Another nice property of orthogonal polynomials is that due to the fact that poly produces a matrix whose columns have unit length and are mutually orthogonal (and also orthogonal to the intercept column) the reduction in residual sum of squares due to the adding the cubic term is simply the square of the length of the projection of the response vector on the cubic column of the model matrix.

# these three give the same result

# 1. squared length of projection of y, i.e. Auto$mpg, on cubic term column
crossprod(model.matrix(fm3)[, 4], Auto$mpg)^2
##         [,1]
## [1,] 15.5934

# 2. difference in sums of squares
deviance(fm2) - deviance(fm3) 
## [1] 15.5934

# 3. difference in sums of squares from anova
anova(fm2, fm3) 
## 
## Analysis of Variance Table
## 
## Model 1: mpg ~ poly(horsepower, 2)
## Model 2: mpg ~ poly(horsepower, 3)
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    389 7442.0                           
## 2    388 7426.4  1    15.593 0.8147 0.3673  <-- note Sum of Sq value

Solution 2:

When introducing polynomial terms in a statistical model the usual motivation is to determine whether the response is "curved" and whether the curvature is "significant" when that term is added in. The upshot of throwing in an +I(x^2) terms is that minor deviations may get "magnified" by the fitting process depending on their location, and misinterpreted as due to the curvature term when they were just fluctuations at one end or other of the range of data. This results in inappropriate assignment of declarations of "significance".

If you just throw in a squared term with I(x^2), of necessity it's also going to be highly correlated with x at least in the domain where x > 0. Using instead: poly(x,2) creates a "curved" set of variables where the linear term is not so highly correlated with x, and where the curvature is roughly the same across the range of data. (If you want to read up on the statistical theory, search on "orthogonal polynomials".) Just type poly(1:10, 2) and look at the two columns.

poly(1:10, 2)
                1           2
 [1,] -0.49543369  0.52223297
 [2,] -0.38533732  0.17407766
 [3,] -0.27524094 -0.08703883
 [4,] -0.16514456 -0.26111648
 [5,] -0.05504819 -0.34815531
 [6,]  0.05504819 -0.34815531
 [7,]  0.16514456 -0.26111648
 [8,]  0.27524094 -0.08703883
 [9,]  0.38533732  0.17407766
[10,]  0.49543369  0.52223297
attr(,"degree")
[1] 1 2
attr(,"coefs")
attr(,"coefs")$alpha
[1] 5.5 5.5

attr(,"coefs")$norm2
[1]   1.0  10.0  82.5 528.0

attr(,"class")
[1] "poly"   "matrix"

The "quadratic" term is centered on 5.5 and the linear term has been shifted down so it is 0 at the same x-point (with the implicit (Intercept) term in the model being depended upon for shifting everything back at the time predictions are requested.)

Solution 3:

A quick answer is that poly of a vector is x essentially equivalent to the QR decomposition of the matrix whose columns are powers of x (after centering). For example:

> x<-rnorm(50)
> x0<-sapply(1:5,function(z) x^z)
> x0<-apply(x0,2,function(z) z-mean(z))
> x0<-qr.Q(qr(x0))
> cor(x0,poly(x,5))
                 1             2             3             4             5
[1,] -1.000000e+00 -1.113975e-16 -3.666033e-17  7.605615e-17 -1.395624e-17
[2,] -3.812474e-17  1.000000e+00  1.173755e-16 -1.262333e-17 -3.988085e-17
[3,] -7.543077e-17 -7.778452e-17  1.000000e+00  3.104693e-16 -8.472204e-17
[4,]  1.722929e-17 -1.952572e-16  1.013803e-16 -1.000000e+00 -1.611815e-16
[5,] -5.973583e-17 -1.623762e-18  9.163891e-17 -3.037121e-16  1.000000e+00