Why eigenvectors with the highest eigenvalues maximize the variance in PCA?
I'm learning Principal Component Analysis (PCA) and came to know that eigenvectors of the covariance matrix of the data are the principal components, which maximizes the variance of the projected data. I understand the intuition behind why we need the variance of projected data as large as possible.
From this answer, I don't understand the following line:
The unit vector $u$ which maximizes variance $u^TΣu$ is nothing but the eigenvector with the largest eigenvalue.
I know how the variance of projected data points is $u^TΣu$ from this answer. But I don't understand why this will be maxed when $u$ is selected as eigenvectors of $u^TΣu$ with the highest eigenvalues?
Intuitively I see eigenvectors as the vectors which stay fixed in their direction under the given linear transformation (values may scale, which are known as eigenvalues). Source: This answer. and this video.
I can't relate why vectors with a fixed direction under given linear transformation give the highest variance? Any intuitive explanation will help! Thanks.
Solution 1:
Let the spectral decomposition of $\Sigma$ be $\Sigma=PDP^T,$ where $P$ is orthonormal and $D$ is diagonal. Then $u^T\Sigma u=\displaystyle\sum_{i=1}^d\lambda_i(p_i^Tu)^2,$ where $p_i$ is the $i^{\text{th}}$ column of $P$, in other words, the $i^\text{th}$ eigenvector of $\Sigma.$
We want to find $u$ such that $\displaystyle\sum_{i=1}^d\lambda_i(p_i^Tu)^2$ is maximized. Since $p_i$'s form an orthonormal basis, $\displaystyle\sum_{i=1}^d(p_i^Tu)^2=1.$ Consider the optimization problem: $$\text{Maximize }\displaystyle\sum_{i=1}^d\lambda_iz_i^2\text{ subject to }\sum_{i=1}^dz_i^2=1.$$ Suppose $\lambda_1\ge\lambda_2\ge\dots\ge\lambda_d.$ It is clear that we must have $z_1=1,$ $z_i=0$ for the rest, because otherwise we will have a lower value of the objective function. That would mean $$p_1^Tu=1,\text{ and }p_i^Tu=0\text{ for all }i\neq 1.$$ By equality in Cauchy-Schwarz inequality, $p_1^Tu=1\iff u=c\times p_1,$ for some constant $c.$ By the norm $1$ constraint, $u=p_1.$
Solution 2:
Take A as the original matrix, x is the eigenvector and λ is the corresponding eigenvalue, then we have Ax=λx. There is an intuitive way to think about it. What we are actually comparing is the projection of A onto the eigenvector x. The larger the projection is, the more variance that vector will represent. Based on this idea, the projection of A onto x is Ax/||x||, which equals to λx/||x||. Since x/||x|| is the unit vector of eigenvector, we only need to check λ. Therefore, the largest λ indicates that the eigenvector retains the largest variance.
Solution 3:
I think the previous answers already contain the meat of the matter but I just want to add the probability point of view so that it's a bit more obvious what the question is asking and why that's the right answer.
Suppose I have some data drawn from some distribution $x_1,x_2,...\sim\mathcal D$ and where $x_i\in \mathbb R^n$. Then I want to find some 1-dimensional representation of $x_i$ where the data is maximized. Well, all projections on 1-dimensional spaces can be written as $$ P_c(x) = c^Tx $$ for some $c$ where $c^Tc = 1$.
Your question is basically, what is the choice of $c$ such that the variance $$ \mathrm{Var}(P_c(x)) = \mathbb E\left[\left(c^Tx-c^T\bar x\right)^2\right] = c^T\underbrace{(\mathbb E[xx^T] - \bar x \bar x^T)}_{=\Sigma}c $$ is maximized, where $\bar x = \mathbb E[x]$.
So, you just need to find the normalized vector $c$ such that $c^T\Sigma c$ is as large as possible. This is $c$ = the eigenvector corresponding to the largest eigenvalue of $\Sigma$. To see this, take the eigenvalue decomposition $\Sigma = PD P^T$ where $P = (p_1,...,p_n)$ and $D = \mathrm{diag}(d_1,...,d_n)$. Without loss of generality, we assume $d_1 \geq d_2 \geq \cdots \geq d_n$. Then taking a change of variables $v = P^Tc$, we can see that
$$ c^T\Sigma c = \sum_{i=1}^n v_i^2 d_i \leq d_1 \underbrace{v^Tv}_{=1} $$ since $v^Tv = c^TP^TPc=c^Tc = 1$. This maximum is achieved with equality only if $v = (1,0,...)$ which corresponds to $c = p_1$.