Connection between PCA and linear regression
Is there a formal link between linear regression and PCA? The goal of PCA is to decompose a matrix into a linear combination of variables that contain most of the information in the matrix. Suppose for sake of argument that we're doing PCA on an input matrix rather than its covariance matrix, and the columns $X_1, X2, ..., X_n$ of the matrix are variables of interest. Then intuitively it seems that the PCA procedure is similar to a linear regression where one uses a linear combination of the variables to predict the entries in the matrix. Is this correct thinking? How can it be made mathematically precise?
Imagine enumerating the (infinite) space of all linear combinations of the variables $X_1, X_2, ...,X_n$ of a matrix of data and doing linear regression on each such combination to measure how much of the rows of the matrix the combination can 'explain'. Is there an interpretation of what PCA doing in terms of this operation? I.e. how in this procedure PCA would select the 'best' linear combinations? I realize this procedure is obviously not computationally feasible, I only present it to try to make the link between PCA and linear regression. This procedure works directly with linear combinations of columns of a matrix so it does not require them to be orthogonal.
Solution 1:
The difference between PCA and regression can be interpreted as being mathematically only one additional multiplication with an inverse matrix...
Here is a correlation matrix with three groups of variables:
group g1: $x_{1,1},x_{1,2}$ correlated
group g2: $x_{2,1},x_{2,2},x_{2,3}$ correlated
group g3: $y$ composed by the variables of group g1 and g2
The correlation matrix looks like
cor x1_1 x1_2 x2_1 x2_2 x2_3 ___y
-----------------------------------------------------------
x1_1 1.000 0.976 0.472 0.444 0.430 0.642
x1_2 0.976 1.000 0.594 0.567 0.538 0.767
x2_1 0.472 0.594 1.000 0.998 0.979 0.919
x2_2 0.444 0.567 0.998 1.000 0.986 0.917
x2_3 0.430 0.538 0.979 0.986 1.000 0.904
___y 0.642 0.767 0.919 0.917 0.904 1.000
-----------------------------------------------------------
This is the loadings-matrix in its initial triangular cholesky form when not yet rotated to PC-position:
tri f1 f2 f3 f4 f5 f6
-----------------------------------------------------------
x1_1 1.000 0.000 0.000 0.000 0.000 0.000
x1_2 0.976 0.218 0.000 0.000 0.000 0.000
x2_1 0.472 0.612 0.634 0.000 0.000 0.000
x2_2 0.444 0.615 0.649 0.054 0.000 0.000
x2_3 0.430 0.543 0.700 0.119 0.128 0.000
___y 0.642 0.644 0.350 0.156 0.112 0.117
-----------------------------------------------------------
In pca we do not distinguish between independent and dependent variables; so we might do a rotation to PCA-position, where then the first axis/column denotes the first principal component and so on.
[22] pca = rot(dr,"pca")
pca f1 f2 f3 f4 f5 f6
-----------------------------------------------------------
x1_1 0.714 -0.692 0.100 0.041 -0.020 0.005
x1_2 0.810 -0.584 -0.031 -0.047 0.025 -0.007
x2_1 0.948 0.300 0.063 -0.081 0.003 0.018
x2_2 0.940 0.333 0.050 -0.049 -0.015 -0.019
x2_3 0.926 0.351 0.072 0.114 0.016 -0.001
___y 0.973 0.045 -0.226 0.027 -0.010 0.004
-----------------------------------------------------------
We see, that all variables have one common factor, but we might also see, that only two or three factors are "relevant". A quartimax-rotation might locate the three main factors better related to the variable-groups
[26] qua = rot(pca,"quartimax",1..6,1..3)
qua.max f1 f2 f3 f4 f5 f6
-----------------------------------+------------------------
x1_1 0.373 0.925 -0.060 0.041 -0.020 0.005
x1_2 0.505 0.859 0.068 -0.047 0.025 -0.007
--------
x2_1 0.988 0.112 -0.059 -0.081 0.003 0.018
x2_2 0.994 0.078 -0.048 -0.049 -0.015 -0.019
x2_3 0.989 0.056 -0.071 0.114 0.016 -0.001
--------
___y 0.908 0.342 0.240 0.027 -0.010 0.004
-----------------------------------+------------------------
We see that we have two main groups with high in-group correlations, saying they measure nearly the same, in the x-variables and a less sharp separated "group" with only the y-variable which is correlated to both groups but has still an individual variance (in factor f3).
This is classical PCA with Quartimax/Varimax-rotation, the "little jiffy"-procedure.
Now we move on to regression. In regression we define one variable as dependent, in our case the variable y. We are interested, in which way y is composed by the independent variables; a still pca-inherent point of view would be that we find the pca of the independent variables only and leve the factor f6, which shows a part of y-variance which is uncorrelated , alone as taken from the initial triangular cholesky-factor.
[29] pca = rot(dr,"pca",1..5)
pca.reg f1 f2 f3 f4 f5 f6
---------------------------------------------------+--------
x1_1 0.729 -0.680 0.067 -0.048 -0.002 0.000
x1_2 0.816 -0.571 -0.067 0.055 0.001 0.000
-------------------
x2_1 0.946 0.315 -0.066 -0.033 0.019 0.000
x2_2 0.936 0.348 -0.041 -0.013 -0.023 0.000
x2_3 0.923 0.366 0.117 0.036 0.004 0.000
---------------------------------------------------+--------
___y 0.957 0.057 -0.076 0.245 -0.039 0.117
---------------------------------------------------+--------
Still the axes show the "factors" and how each variable is composed by that common (f1 to f5) or individual (f6) factors.
Regression asks now for composition of y not by the factors/coordinates on the axes but by the coordinates on variables x if they are taken as axes.
Happily we need only multiply the current loadings-matrix by the inverse of the x-submatrix to get axes defined by the x and have the "loadings" of y on x:
[32] B = (pca[*,1..5]*inv(pca[1..5,1..5]) ) || pca[*,6]
reg.B x1_1 x1_2 x2_1 x2_2 x2_3 ___y
---------------------------------------------------+--------
x1_1 1.000 0.000 0.000 0.000 0.000 0.000
x1_2 0.000 1.000 0.000 0.000 0.000 0.000
-------------------
x2_1 0.000 0.000 1.000 0.000 0.000 0.000
x2_2 0.000 0.000 0.000 1.000 0.000 0.000
x2_3 0.000 0.000 0.000 0.000 1.000 0.000
---------------------------------------------------+--------
___y -1.442 1.989 -1.394 0.955 0.876 0.117
We see, that each of the axes is identified with one of the variables and also the "loadings" of y on that axes. But because the axes show now the directions of the x the loadings of y in the last row are now also the "regression"-weights/ coefficients, and the regression weights are now
B = [-1.442 1.989 -1.394 0.955 0.876 ]
(Because of the strong correlations in the groups the regression-weights are above 1 and also the signs are alternating. But this is not much of concern here in that methodic explanation)
[Update]
Relating PCA and regression in this way, there occurs very fluently another instructive example which might improve intuition. This is the problem of multicollinearity, which if occurs in regression is a problem for the researcher, but if occurs in PCA only improves the validity of estimation of separate components and the loadings of the items on such (latent) constructs.
The means, which I want to introduce here, is the "main direction" of the multicollinear items (which of course is the first principal component) respectively the two sets of independent items $x1$ and $x2$. We can introduce latent variables which mark the pc's of the two sets of x-items. This can practically be done applying pc-rotation with maximizing criterion taken from the sets only:
[32] pc1 = rot(tri,"pca",1..2,1..5) //rotating for pc of items 1..2 only
[33] pc1 = {pc1,marksp(pc1,{1,2})} //add markers for that pc's
PC1 pc1_1 pc1_2 c3 c4 c5 c6
--------------------------------------------------------------
x1_1 0.994 -0.109 0 0 0 0
x1_2 0.994 0.109 0 0 0 0
--------------------------------------------------------------
x2_1 0.562 0.571 0.599 0 0 0
x2_2 0.536 0.578 0.613 0.054 0 0
x2_3 0.521 0.516 0.657 0.123 0.125 0
--------------------------------------------------------------
y 0.722 0.575 0.316 0.151 0.11 0.116
===============================================================
pc1_1 1 0 0 0 0 0
pc1_2 0 1 0 0 0 0
===============================================================
If we were looking at the system of $x1_1,x1_2$ and $y$ only, we had already the beta-values for the pc's of that $x1$-item set as $b_{pc1_1}=0.722$ and $b_{pc1_2}=0.575$ - no hazzle because of (multi)collinearity!
The same can be done with the second set of independent items $x2_1,x2_2,x2_3$:
[37] pc2 = rot(pc1,"pca",3..5,1..5)
[38] pc2 = {pc2,marksp(pc2,{1,2,3})}
PC2 pc2_1 pc2_2 pc2_3 c4 c5 c6
--------------------------------------------------------------
x1_1 0.477 -0.124 -0.464 0.729 0.101 0
x1_2 0.599 -0.189 -0.389 0.636 0.221 0
--------------------------------------------------------------
x2_1 0.997 -0.078 -0.022 0 0 0
x2_2 0.999 -0.041 0.027 0 0 0
x2_3 0.993 0.119 -0.005 0 0 0
--------------------------------------------------------------
y 0.923 -0.029 -0.058 0.212 0.294 0.116
===============================================================
pc1_1 0.542 -0.157 -0.429 0.687 0.162 0
pc1_2 0.557 -0.300 0.343 -0.424 0.55 0
--------------------------------------------------------------
pc2_1 1 0 0 0 0 0
pc2_2 0 1 0 0 0 0
pc2_3 0 0 1 0 0 0
===============================================================
The beta-value for the first pc of the second set of items (in a model without the first set) were $b_{pc2_1}=0.923$ which is more than that $b_{pc1_1}=0.722$ for the first pc of the first set of independent items.
Now to see the betas using the joint model requires again only the inversion of the submatrix of the loadings of the whole set of 5 pc-markers and postmultiplying the first 5 columns with that. This gives us the "loadings", if the 5 pcs are taken as axes of a coordinate system. We get
[42] beta = pca1[*,1..5]* inv(pca1[7..11,1..5]) || pca1[*,6]
beta pc1_1 pc1_2 pc2_1 pc2_2 pc2_3 c6
--------------------------------------------------------------
x1_1 0.994 -0.109 0 0 0 0
x1_2 0.994 0.109 0 0 0 0
--------------------------------------------------------------
x2_1 0 0 0.997 -0.078 -0.022 0
x2_2 0 0 0.999 -0.041 0.027 0
x2_3 0 0 0.993 0.119 -0.005 0
--------------------------------------------------------------
y 0.540 0.375 0.421 0.169 0.045 0.116
===============================================================
pc1_1 1 0 0 0 0 0
pc1_2 0 1 0 0 0 0
--------------------------------------------------------------
pc2_1 0 0 1 0 0 0
pc2_2 0 0 0 1 0 0
pc2_3 0 0 0 0 1 0
===============================================================
In short:
beta pc1_1 pc1_2 | pc2_1 pc2_2 pc2_3 | c6
---------------------------+-------------------------------+--------
y 0.540 0.375 | 0.421 0.169 0.045 | 0.116
===========================+===============================|========
In that joint model the "main direction" of the first set of independents has a beta-weight of 0.540 and the "main direction" of the second set of 0.421 . The value at $c6$ is here only for completeness: its square $0.116^2$ is the unexplained variance of the dependent item $y$.
Solution 2:
No. PCA returns mixtures of (typically dependent) variables that most contribute to observed variance. Linear regression does not mix variables while expressing the output variables as linear combinations of the input variables. (Linear combinations? Yes. Mixtures of input and output variables? No.)
PCA is computed by singular value decomposition. Linear regression is not. Consequences: linear regression misbehaves if two or more input variables are correlated (i.e., the system is rank deficient), typically producing outputs sensitively dependent on noise. PCA merges the (almost) indistinguishable variables together and continues giving sane results. This merging of strongly correlated variables can result in reduction in the apparent dimensionality of the system, generally improving explanatory power and prediction.
The process of using PCA for linear regression is called Principal Component Regression.