I'm learning Principal Component Analysis (PCA) and came to know that eigenvectors of the covariance matrix of the data are the principal components, which maximizes the variance of the projected data. I understand the intuition behind why we need the variance of projected data as large as possible.

From this answer, I don't understand the following line:

The unit vector $u$ which maximizes variance $u^TΣu$ is nothing but the eigenvector with the largest eigenvalue.

I know how the variance of projected data points is $u^TΣu$ from this answer. But I don't understand why this will be maxed when $u$ is selected as eigenvectors of $u^TΣu$ with the highest eigenvalues?

Intuitively I see eigenvectors as the vectors which stay fixed in their direction under the given linear transformation (values may scale, which are known as eigenvalues). Source: This answer. and this video.

I can't relate why vectors with a fixed direction under given linear transformation give the highest variance? Any intuitive explanation will help! Thanks.


Solution 1:

Let the spectral decomposition of $\Sigma$ be $\Sigma=PDP^T,$ where $P$ is orthonormal and $D$ is diagonal. Then $u^T\Sigma u=\displaystyle\sum_{i=1}^d\lambda_i(p_i^Tu)^2,$ where $p_i$ is the $i^{\text{th}}$ column of $P$, in other words, the $i^\text{th}$ eigenvector of $\Sigma.$

We want to find $u$ such that $\displaystyle\sum_{i=1}^d\lambda_i(p_i^Tu)^2$ is maximized. Since $p_i$'s form an orthonormal basis, $\displaystyle\sum_{i=1}^d(p_i^Tu)^2=1.$ Consider the optimization problem: $$\text{Maximize }\displaystyle\sum_{i=1}^d\lambda_iz_i^2\text{ subject to }\sum_{i=1}^dz_i^2=1.$$ Suppose $\lambda_1\ge\lambda_2\ge\dots\ge\lambda_d.$ It is clear that we must have $z_1=1,$ $z_i=0$ for the rest, because otherwise we will have a lower value of the objective function. That would mean $$p_1^Tu=1,\text{ and }p_i^Tu=0\text{ for all }i\neq 1.$$ By equality in Cauchy-Schwarz inequality, $p_1^Tu=1\iff u=c\times p_1,$ for some constant $c.$ By the norm $1$ constraint, $u=p_1.$

Solution 2:

Take A as the original matrix, x is the eigenvector and λ is the corresponding eigenvalue, then we have Ax=λx. There is an intuitive way to think about it. What we are actually comparing is the projection of A onto the eigenvector x. The larger the projection is, the more variance that vector will represent. Based on this idea, the projection of A onto x is Ax/||x||, which equals to λx/||x||. Since x/||x|| is the unit vector of eigenvector, we only need to check λ. Therefore, the largest λ indicates that the eigenvector retains the largest variance.

Solution 3:

I think the previous answers already contain the meat of the matter but I just want to add the probability point of view so that it's a bit more obvious what the question is asking and why that's the right answer.

Suppose I have some data drawn from some distribution $x_1,x_2,...\sim\mathcal D$ and where $x_i\in \mathbb R^n$. Then I want to find some 1-dimensional representation of $x_i$ where the data is maximized. Well, all projections on 1-dimensional spaces can be written as $$ P_c(x) = c^Tx $$ for some $c$ where $c^Tc = 1$.

Your question is basically, what is the choice of $c$ such that the variance $$ \mathrm{Var}(P_c(x)) = \mathbb E\left[\left(c^Tx-c^T\bar x\right)^2\right] = c^T\underbrace{(\mathbb E[xx^T] - \bar x \bar x^T)}_{=\Sigma}c $$ is maximized, where $\bar x = \mathbb E[x]$.

So, you just need to find the normalized vector $c$ such that $c^T\Sigma c$ is as large as possible. This is $c$ = the eigenvector corresponding to the largest eigenvalue of $\Sigma$. To see this, take the eigenvalue decomposition $\Sigma = PD P^T$ where $P = (p_1,...,p_n)$ and $D = \mathrm{diag}(d_1,...,d_n)$. Without loss of generality, we assume $d_1 \geq d_2 \geq \cdots \geq d_n$. Then taking a change of variables $v = P^Tc$, we can see that

$$ c^T\Sigma c = \sum_{i=1}^n v_i^2 d_i \leq d_1 \underbrace{v^Tv}_{=1} $$ since $v^Tv = c^TP^TPc=c^Tc = 1$. This maximum is achieved with equality only if $v = (1,0,...)$ which corresponds to $c = p_1$.