Let $A$ be a symmetric, real matrix. The goal is to find a unit vector $v$ such that the value $v^{T}Av$ is

  1. maximized, and
  2. minimized.

The answer is that $v$ should be the eigenvector of $A$ with

  1. largest eigenvalue, and
  2. smallest eigenvalue.

Could anyone give an explanation why? What do eigenvectors have to do with maximizing such a number?

Please make it as 'step by step' as possible, express it using very basic algebra (if possible). I don't quite understand MooS' answer.


There is an orthogonal matrix $T$, such that $T^tAT$ is diagonal.

Because of $v^t(T^tAT)v=(Tv)^tA(Tv)$ and $\|Tv\|=\|v\|$ one can $A$ assume to be diagonal and in this case the assertion is immediate, since $v^tAv = \sum_{i} v_i^2\lambda_i$ with the $\lambda_i$ being the diagonal elements (eigenvalues).

Let me provide some more details:

First of all we show $\|Tv\|=\|v\|$: We have $T^tT=1$ (by definition), hence $$\|Tv\|^2=(Tv)^t(Tv)=v^t(T^tT)v=v^tv=\|v\|^2$$

For the rest: We have to consider the numbers $v^tAv$, where $v$ runs through all vectors of length $1$. Since $T$ is orthogonal, it is a bijection on the unit sphere by our above argument, hence $Tv$ runs through all vectors of length $1$, if $v$ does so.

So considering the numbers $v^tAv$ is the same as considering the numbers $(Tv)^tA(Tv)$. Now the computation $$(Tv)^tA(Tv) = v^t(T^tAT)v$$ shows that we have to consider the numbers $v^t(T^tAT)v$, where $v$ runs through all vectors of length $1$. Since $T^tAT$ is diagonal, we are in the starting situation, but the matrix is diagonal now. So we could have assumed from the start that $A$ is diagonal, hence $A = \mathrm{diag}(\lambda_1, \dotsc, \lambda_n)$.

But in this case, the result is easy, since we have $v^tAv = \sum_{i} v_i^2\lambda_i$. Maximizing or minimizing this expression with respect to $\sum_i v_i^2=1$ is an easy task: The minimum is the minimal $\lambda_i$ and the maximum is the maximal $\lambda_i$.

You should really get used to such 'diagonalization arguments': It is the main reason, why diagonalizing matrices is such an important tool.


I'll make here a very informal attempt at explaining what eigenvectors of smallest and greatest eigenvalues have to do with minimizing $v^{T}Av$. It will be neither rigorous nor will it cover all cases, but certainly I hope it will enlighten you.


Decompositions

Matrices can often be re-expressed as a product of two, three or more other matrices, usually chosen to have "nice" properties. This re-expression is called a decomposition, and the act of re-expressing is called decomposing. There are many decompositions, and they are classified by the kind of "nice" properties that the resultant product matrices have. Some of these decompositions always exist, while some are applicable only to a few select types of matrices, and some will produce results with "nicer" properties if their input is nicer too.

Eigendecomposition

We'll here be interested in one particular decomposition. It's called the eigendecomposition, or alternatively spectral decomposition. It takes a matrix $A$ and decomposes it into a matrix product $A = Q \Lambda Q^{-1}$. Both $Q$ and $\Lambda$ have the following "nice" properties:

  • $Q$'s columns contain the eigenvectors of $A$.
  • $\Lambda$ is a diagonal matrix containing, for each eigenvector, the corresponding eigenvalues of $A$.

Nice properties of Eigendecomposition

As mentioned previously, decompositions can have nicer properties if they have nice input. As it happens, we do have (very) nice input; $A$ is symmetric, and real. Under those conditions, the result of eigendecomposition has the following extra "nice" properties:

  • $\Lambda$'s entries are all real numbers.
    • Therefore the eigenvalues of $A$ are real numbers.
  • $Q$ is orthogonal.
    • Therefore, $Q$'s columns, the eigenvectors, are all unit vectors that are orthogonal to each other. This makes the $Q^{-1}$ matrix easy to compute once you have $Q$: It's just $Q^{-1}=Q^{T}$.

Multiplying a vector by a matrix

Let's change tracks for a moment. Suppose now you have an $n$-row matrix matrix $M$. When you carry out a matrix-vector multiplication $v' = Mv$, you compute $n$ dot-products: $v \cdot M_\textrm{row1}$, $v \cdot M_\textrm{row2}$, ..., $v \cdot M_\textrm{row$n$}$, which become a new $n$-element vector $v'$.

$$M = \left( \begin{array}{c} M_{\textrm{row$_1$}} \\ M_{\textrm{row$_2$}} \\ \cdots \\ M_{\textrm{row$_n$}} \end{array} \right)$$ $$v' = Mv = \left( \begin{array}{c} M_{\textrm{row$_1$}} \\ M_{\textrm{row$_2$}} \\ \cdots \\ M_{\textrm{row$_n$}} \end{array} \right) v = \left( \begin{array}{c} M_{\textrm{row$_1$}} \cdot v \\ M_{\textrm{row$_2$}} \cdot v \\ \cdots \\ M_{\textrm{row$_n$}} \cdot v \end{array} \right)$$

As you know, $a \cdot b$, where $b$ is a unit vector, calculates the projection of $a$ onto $b$; In other words, how much do they overlap or shadow each other.

Therefore, by doing that matrix-vector multiplication $Mv$, you've re-expressed $v$ in terms of its $n$ projections onto the $n$ row vectors of $M$. You can think of that as a re-encoding/re-expression/compression of $v$ of sorts. The word "compression" is especially apt, since some information may be lost; This will be the case if $M$ isn't square or its rows aren't orthogonal to each other.

Multiplying a vector by an orthogonal matrix

But what if $M$ is orthogonal? Then all its rows are orthogonal to each other, and if so, it's actually possible to not lose any information at all when doing $v' = Mv$, and it's possible to recover the original vector $v$ from $v'$ and $M$!

Indeed, take an $n \times n$ orthogonal matrix $M$ and a $n$-element vector, and compute $v' = Mv$. You'll compute the $n$ projections of $v$ onto the $n$ rows of $M$, and because these rows of $M$ are fully independent of each other, the projections of $v$ (stored as elements of $v'$) contain no redundant information between themselves, and therefore nothing had to be pushed out or damaged to make space.

And because you've losslessly encoded the vector $v$ as a vector $v'$ of projections onto the rows of $M$, it's possible to recreate $v$, by doing the reverse: Multiplying the rows of $M$ by the projections in $v'$, and summing them up!

To do that, we must transpose $M$, since we're left-multiplying $v'$ by $M$. Whereas we previously had the rows of $M$ where we conveniently wanted them (as rows) to do the encoding, now we must have them as the columns in order to do the decoding of $v'$ into $v$. Whence we get

$$v = M^T v'$$ $$v = M^T (Mv)$$ $$v = (M^T M)v$$ $$v = v$$

As an aside, this is the reason why orthogonal matrices $Q$ have the property

$$I = QQ^T = Q^T Q$$

And hence why

$$Q^{-1}=Q^T$$

. You'll recall I pointed out this property earlier on.

Recap

Given an orthogonal matrix $M$ and vector $v$:

Encoding $v \to v'$ as projections onto rows of $M$ ($v' = Mv$): $$M = \left( \begin{array}{c} M_{\textrm{row$_1$}} \\ M_{\textrm{row$_2$}} \\ \cdots \\ M_{\textrm{row$_n$}} \end{array} \right)$$ $$v' = Mv = \left( \begin{array}{c} M_{\textrm{row$_1$}} \\ M_{\textrm{row$_2$}} \\ \cdots \\ M_{\textrm{row$_n$}} \end{array} \right) v = \left( \begin{array}{c} M_{\textrm{row$_1$}} \cdot v \\ M_{\textrm{row$_2$}} \cdot v \\ \cdots \\ M_{\textrm{row$_n$}} \cdot v \end{array} \right)$$

Decoding $v' \to v$ by multiplying the rows of $M$ by the projections onto them, and summing up ($v = M^{T}v'$). $$M^T = \left( \begin{array}{c} M_{\textrm{row$_1$}}^T & M_{\textrm{row$_2$}}^T & \cdots & M_{\textrm{row$_n$}}^T \end{array} \right)$$ $$v = M^{T}v' = \left( \begin{array}{c} M_{\textrm{row$_1$}}^T & M_{\textrm{row$_2$}}^T & \cdots & M_{\textrm{row$_n$}}^T \end{array} \right) v'$$ $$= \left( \begin{array}{c} M_{\textrm{row$_1$}}^T & M_{\textrm{row$_2$}}^T & \cdots & M_{\textrm{row$_n$}}^T \end{array} \right) \left( \begin{array}{c} M_{\textrm{row$_1$}} \cdot v \\ M_{\textrm{row$_2$}} \cdot v \\ \cdots \\ M_{\textrm{row$_n$}} \cdot v \end{array} \right)$$ $$= (M_{\textrm{row$_1$}} \cdot v)M_{\textrm{row$_1$}}^T + (M_{\textrm{row$_2$}} \cdot v)M_{\textrm{row$_2$}}^T + \cdots + (M_{\textrm{row$_n$}} \cdot v)M_{\textrm{row$_n$}}^T$$ $$=v$$

Multiplying a vector by an eigendecomposed matrix

We now get to the crux of my argument. Suppose now we don't treat that matrix $A$ from so long ago as a black box, but instead look under the hood, at its eigendecomposition $A = Q\Lambda Q^{-1}$, or in this particular case $A = Q\Lambda Q^{T}$. See those orthogonal $Q$'s sandwiching a diagonal matrix? Well, $Q$ has the eigenvectors in its columns, so $Q^T$ will have them in its rows. We've seen how a $Q$-$Q^T$ or $Q^T$-$Q$ sandwhich essentially encodes/decodes, and we're here encoding/decoding over $A$'s eigenvectors. The only twist here is this extra $\Lambda$ matrix!

What it effectively does is that after encoding but before decoding, it scales each of the components of the encoded vector independently, and then hands off to the decoding matrix.

To maximize $v^T A v = v^T Q \Lambda Q^T v$, our goal is therefore to choose $v$ such that when it is encoded, all its energy is in the component that is then scaled by the largest eigenvalue in $\Lambda$! And how do we achieve that? By choosing it to be the eigenvector corresponding to that eigenvalue! This maximizes the value of the dot-product for that component in the encoding step, and this maximally large value gets scaled by the largest eigenvalue we have access to. If we fail to align 100% of the energy of $v$ with the eigenvector of greatest eigenvalue, then when $v$ will be encoded, some of that energy will bleed out into other components, and be multiplied by a lesser eigenvalue, and it won't be the maximum possible anymore.

Example

$$A = Q\Lambda Q^T$$ $$A = \left(\begin{array}{c} \vec e_1 & \vec e_2 & \vec e_3 \end{array}\right) \left(\begin{array}{c} \sigma_1 & 0 & 0 \\ 0 & \sigma_2 & 0 \\ 0 & 0 & \sigma_3 \end{array}\right) \left(\begin{array}{c} \vec e_1^T \\ \vec e_2^T \\ \vec e_3^T \end{array}\right)$$

Suppose $\sigma_1 = 2$, $\sigma_2 = 5$, $\sigma_3 = 4$. Then the largest eigenvalue is $\sigma_2$, and we want to have as much as possible (in fact, all) of our unit vector to be scaled by $\sigma_2$. How do we do that? Well, we choose the unit vector parallel to $\vec e_2$! Thereafter we get

$$v^T A v = v^T Q \Lambda Q^T v$$ With $v = \vec e_2$, we have $$= \vec e_2^T \left(\begin{array}{c} \vec e_1 & \vec e_2 & \vec e_3 \end{array}\right) \left(\begin{array}{c} 2 & 0 & 0 \\ 0 & 5 & 0 \\ 0 & 0 & 4 \end{array}\right) \left(\begin{array}{c} \vec e_1^T \\ \vec e_2^T \\ \vec e_3^T \end{array}\right) \vec e_2$$ $$= \left(\begin{array}{c} \vec e_2^T \cdot \vec e_1 & \vec e_2^T \cdot \vec e_2 & \vec e_2^T \cdot \vec e_3 \end{array}\right) \left(\begin{array}{c} 2 & 0 & 0 \\ 0 & 5 & 0 \\ 0 & 0 & 4 \end{array}\right) \left(\begin{array}{c} \vec e_1^T \cdot \vec e_2 \\ \vec e_2^T \cdot \vec e_2 \\ \vec e_3^T \cdot \vec e_2 \end{array}\right)$$ $$= \left(\begin{array}{c} 0 & 1 & 0 \end{array}\right) \left(\begin{array}{c} 2 & 0 & 0 \\ 0 & 5 & 0 \\ 0 & 0 & 4 \end{array}\right) \left(\begin{array}{c} 0 \\ 1 \\ 0 \end{array}\right)$$

We're successful! Because our choice of $v$ is parallel to $\vec e_2$ (the 2nd eigenvector), their dot product was a perfect 1, and so 100% of the energy of our vector went into the second component, the one that will be multiplied by $\sigma_\textrm{max} = \sigma_2$! We continue:

$$= \left(\begin{array}{c} 0 & 1 & 0 \end{array}\right) \left(\begin{array}{c} 0 \\ 5 \\ 0 \end{array}\right)$$ $$= 5 = \sigma_2$$

We've achieved the maximum gain $G$ possible, $G=\sigma_2=5$!

A similar logic can be applied to the case of the minimum eigenvalue.


If you really don't like that diagonalization argument (which is correct), there is another one using Lagrange multiplier (which is quite interesting in its own).

Let $g(v) = |v|^2 -1$. You want to maximize/minimize $f(v) = v^TAv$ subject to $g(v) = 0$. So by Lagrange multiplier, there is $\lambda$ so that

$$\nabla f = \lambda \nabla g, $$

Note $\nabla g(v) = 2v$. On the other hand, we have $\nabla f = 2Av$ as

$$\nabla f = A v+ (v^T A)^T = Av + A^T v = 2Av$$

as $A$ is symmetric. Thus we have $Av = \lambda v$ and so $v$ is an eigenvector. Putting this into $f$, then

$$f(v) = v^TAv = v^T \lambda v = \lambda$$

so the maximum/minimum value of $f$ is attained by eigenvalues (and it has to be the largest/smallest one).


I'm going to attempt to explain it to you on as "intuitive level" as possible, with reference to the theory - if you want to go deeper in a specific aspect, you can read more about it. My approach is with reference to the linear algebra principles behind the matrices and vectors, i.e. $A$ represents a linear transformation...since it is symmetric it has a very special property, which is best explained by decomposing it into three simpler matrices (linear operators). This is the fact, as mentioned before here, that $A=PDP^T$. What does this mean in linear algebra terms? The matrix $P^T$ represents an orthogonal operator, that is, if you transform a vector with $P^T$ it does not change the vector's length. In 2-dimensional space $P^T$ is either a rotation or a reflection about a line through the origin...in higher dimensions these notions are generalized and $P^T$ could still be a rotation, but non-rotational orthogonal transformations is more complex than simply "reflections". Nonetheless, the key is that $P^T$ does not change the length of the vector. Essentially what $P^T$ does is to take the vector $v$ and then regards it in a new axis-system (change of basis) from the standard axis system. It is still a very "nice" axis system in that the principal axis' are perpendicular. Then $D$ goes to work, scaling/distorting the transformed vector $v$ by a fixed quantity along each axis of this new coordinate system - the scaling along each axis corresponds to each diagonal value in $D$. Finally the transformation $P$ reverses the effects of $P^T$ restoring the transformed vector back to the original coordinate system.

So now, what does $v^TAv$ mean? First, $v^Tv$ is simply $\|v\|^2$, the norm of $v$ squared. If $v$ is a unit vector, then this will be 1. But if we now have $v^TAv$, it means we first transform $v$ by $A$ en then take an inner product by $v$ on the new vector $Av$. If $v$ falls on an axis in the new coordinate system induced by $P^T$, corresponding to the biggest scaling factor in $D$, then $v^T(Av)$ is maximized since $D$'s effect is maximized....and a similar argument holds for the minimum.

Also if $v$ coincides with an axis in the new coordinate system the orientation of $v$ is not changed by $D$ (since the other scaling factors in $D$ operate perpendicular to the scaling factor corresponding to the axis on which $v$ falls). This implicates that the inner product of $Av$ with $v$ is maximized, since the inner product between two vectors are maximized when they coincide (i.e. point in the same direction).

I hope this helps...