What's the point of orthogonal diagonalisation?
Solution 1:
When you have a matrix $A$ that is diagonalisable, the matrix $U$ such that $U^{-1}AU=D$ is diagonal is the matrix whose columns are a basis of eigenvectors of $A$. Having a basis of eigenvectors makes understanding $A$ much easier, and makes computations with $A$ easy as well.
What the basis of eigenvectors does not do, however, is preserve the usual notions of size of vectors, distance between vectors, or angles between vectors, that come from the inner/dot product. The standard basis for $\mathbb{R}^n$ is particularly nice because it is not only a basis, it is an orthonormal basis. Orthonormal bases make all sorts of things easy to compute, such as how to express vectors as linear combinations of them, least squares, the size of vectors, angles between vectors, etc. For example, if you think about it, it's very easy to express a vector as a linear combination of the vectors in the standard basis; it's usually not so easy to do it in terms of some other bases. With orthonormal bases, all you need to do is take the dot product with the basis elements to get the coefficients of the linear transformation. Very easy.
So now you have two notions that make life simpler: a basis of eigenvectors makes your life easy relative to the linear transformation/basis; an orthonormal basis makes your life easy relative to the vector space itself. Unfortunately, usually you have to choose one or the other, and you cannot have both.
When a matrix $P$ is orthogonal, then its columns are not just a basis for $\mathbb{R}^n$, they are an orthonormal basis. The fact that $A$ is orthogonally diagonalisable means that, for $A$, you don't have to choose! The columns of $P$ are both an orthonormal basis, and a basis of eigenvectors for $A$. So you get the best of both worlds. And being an orthonormal basis, you still keep a good handle on the notions of distance, angles between vectors, and so on.
Solution 2:
Mathematics is all about equivalence relations. Often we are not really interested in objects on-the-nose, we are interested in objects up to some natural equivalence relation. Matrices really define linear transformations, so in order to talk about linear transformations using matrices we need to specify when two matrices give the same abstract linear transformation. At the level of a first course in linear algebra, there are at least two natural choices, which are often not distinguished.
- Up to change of basis. This is a reasonable choice if your vector space has no natural norm or inner product (e.g. if it is a space of polynomials). The nicest matrices here are the ones that can be diagonalized (by an arbitrary invertible matrix).
- Up to change of orthonormal basis. This is a reasonable choice if your vector space has an inner product / Euclidean norm, since you want to be able to compute inner products in the easiest possible way with respect to your basis. Here the nicest matrices are the ones that can be orthogonally diagonalized.
One of the most important examples of the latter is the use of Fourier series, which orthogonally diagonalize differentiation. This is an important technique for solving differential equations.
Perhaps you are also not familiar with the point of diagonalization. The point of diagonalization is to change coordinates so that the linear transformation you're interested in is as simple as possible (it doesn't get simpler than diagonal matrices). That makes it easy to analyze, as in the Fourier series example above.
Solution 3:
Matrices are complicated objects. At first glance they are rectangular arrays of numbers with a complicated multiplication rule. Diagonalization helps us reduce the the matrix multiplication operation to a sequence of simple steps which make sense. In your case you are asking about orthogonal diagonalization, so I will limit my comments to that. Note that I normally think about diagonalization as a factorization of the matrix $\mathbf A$ as
$$\mathbf A = \mathbf P \mathbf D \mathbf P^T$$
We know that $\mathbf D$ contains the eigenvalues of $\mathbf A$ and $\mathbf P$ contains the corresponding eigenvectors. Consider the multiplication $\mathbf{y=Ax}$. We can perform the multiplication $\mathbf y = \mathbf P \mathbf D \mathbf P^T \mathbf x$ step-by-step as follows:
First: $\mathbf x' = \mathbf P^T \mathbf x$. This step projects $\mathbf x$ onto the eigenvectors because the eigenvectors are in the rows of $\mathbf P^T$.
Second: $\mathbf y' = \mathbf D \mathbf x'$. This step stretches the resultant vector independently in each direction which corresponds to an eigenvector. This is the key. A matrix does a simple scaling (in general we may also shear or rotate) operation in independent directions (orthogonal case only!) in a particular basis or representation.
Third: $\mathbf y = \mathbf P \mathbf y'$. We take our resulting stretched vector and linearly combine the eigenvectors to get back to our original space.
So in summary, diagonalization tells us that under matrix multiplication, as vector can be represented in a special basis under which the actual operation of the matrix is a simple diagonal matrix (the simplest possible) and then represented back in our original space.
Solution 4:
Arturo has given a nice answer. Here are couple of my additions to Arturo's answer.
Any symmetric matrix which can be diagonalized can be re-written in an orthogonal diagonalized form. Another aspects of this orthogonal diagonalization is that this essentially means that the left and right singular vectors are the same (except probably for a change of sign). The singular values for such a matrix are nothing but the magnitude of the eigen values. So in essence, the orthogonal diagonalization gives the Singular Value Decomposition as well and knowing the SVD is all you need to know about any matrix.
Solution 5:
If a matrix $A$ is unitarily diagonalizable, then one can define a "Fourier transform" for which $A$ is a "convolution" matrix.
Here is an example. We have a family $F$ of subsets of some finite set $S$, i.e. $F \subset 2^S$, such that any two sets in $F$ agree in some coordinate, i.e. for any two $A,B \in F$, there is some element $x \in S$ that either belongs to both $A,B$ or to neither; we call such a family $F$ agreeing.
How big can an agreeing family $F$ be? Clearly, it can contain at most half the sets, since $A$ and $S \setminus A$ cannot both belong to $F$. On the other hand, for any $x \in S$, the family $$F = \{ A \subset S : x \in A \}$$ is an agreeing family containing exactly half the sets.
Here is a different proof. Let $F$ be an agreeing family, and $f$ its characteristic function. i.e. $f(A) = 1$ iff $A \in F$. Its Fourier transform is defined by $$ \hat{f}(B) = 2^{-|S|} \sum_A f(A) (-1)^{|A \cap B|}. $$ The Fourier transform is really presenting the function $f$ in the orthonormal basis $$\chi_B(A) = (-1)^{|A\cap B|},$$ which is orthonormal with respect to the inner product $$\langle g,h \rangle = 2^{-|S|} \sum_A g(A) h(A). $$ The Fourier transform is defined so that $$f = \sum_B \hat{f}(B) \chi_B, $$ and so the formula for the transform can also be written $$\hat{f}(B) = \langle f,\chi_B \rangle;$$ this works since the basis is orthonormal.
An easy calculation gives that $\hat{f}(\emptyset) = |F|/2^{|S|}$. Moreover, $$\langle f,f \rangle = \sum_{B,C} \langle \hat{f}(B) \chi_B, \hat{f}(C) \chi_C \rangle = \sum_B \hat{f}(B)^2,$$ again from orthonormality. Note that $$\langle f,f\rangle = \langle f^2, 1 \rangle = \langle f,\chi_\emptyset \rangle = \hat{f}(\emptyset),$$ where we used the fact that $f$ is $\{0,1\}$-valued.
Consider now the operator $X$ which corresponds to complementation, i.e. $$Xe_A = e_{S \setminus A},$$ where $e_A$ is the vector which is $1$ only for the set $A$. Another way to look at the operator $X$ is that it is convolution with $S$, where the group operation is symmetric difference (i.e. $A \triangle S = S \setminus A$).
A straightforward computation shows that the eigenvectors of $X$ are exactly the Fourier basis vectors $\chi_B$: $$ X \chi_B(A) = (-1)^{|(S\setminus A)\cap B|} = (-1)^{|S \cap B|} (-1)^{|A \cap B|} = (-1)^{|B|} \chi_B(A). $$ This is not surprising since $X$ is a convolution operator for the group $\mathbb{Z}_2^{|S|}$, and the Fourier basis is a basis of characters for that abelian group.
Since the family $f$ is agreeing, $f(A)f(S\setminus A) = 0$, and so $$0 = \langle f,Xf \rangle = \sum_B (-1)^{|B|} \hat{f}(B)^2,$$ where we again used orthonormality of the Fourier basis. Denoting $|f| = |F|/2^{|S|}$, we have seen above that $$\sum_B \hat{f}(B)^2 = |f|, \quad \hat{f}(\emptyset)^2 = |f|^2.$$ So the even and odd squared Fourier coefficients balance; their total weight is $|f|$, and the weight of one of them is $|f|^2$. Evidently, $|f|^2$ can be at most half the total weight, and so $|f| \leq 1/2$.
This proof might seem silly (since we presented a "one-line" proof preceding it), but the same method can be used to prove much more difficult theorems. For example, we can look at a family of "colored" sets, i.e. generalize the two colors above (corresponding to "not in the set" and "in the set") to an arbitrary number of colors. The largest family is still obtained by fixing one coordinate, but the combinatorial proof is more difficult (takes about a page); the Fourier proof is almost the same.
Here are some more difficult examples:
- Families of graphs over a fixed vertex set, the intersection of any two of which contains a triangle. The "best" family is obtained by taking all supergraphs of a fixed triangle. The only known proof is using very similar Fourier methods.
- Families of permutations, any two of which have a common "input/output" pair. The "best" family is in general obtained by fixing the image of some element (all permutations taking $i$ to $j$). There is a simple combinatorial proof of that along the lines above. But it's much harder to prove that these are the unique optimal families, whereas it follows relatively easily from the "Fourier" proof. Moreover, the latter can be extended to the case where the permutations are required to have $t$ matching input/output pairs. The "best" families take $t$ fixed inputs to $t$ fixed outputs. The only known proof is spectral (it requires the representation theory of $S_n$).
- Families of subsets of $S$ of size $k < |S|/2$, any two of which intersect. The maximal families are the same as above (supersets of a fixed elements). This is the celebrated Erdős-Ko-Rado theorem. There is a Fourier proof with some extra benefits over some of the combinatorial proofs, viz. it can describe the structure of almost optimal families.
The Fourier proof of the last example actually optimizes some skewed measure of a general family of subsets, in other words, instead of limiting the sets to be of size $k$, we just give more "relevance" to sets of size roughly $k$. The proof goes as follows:
- Find some inner product so that $\langle f,1 \rangle_p$ is the required skewed measure (which is $\mu_p(A) = p^{|A|} (1-p)^{|S\setminus A|}$ for $p \approx k/n$).
- Find some "convolution operator" $X$ such that $\langle f,Xf \rangle_p = 0$ for every intersecting family, and moreover the eigenvectors of $X$ are orthogonal with respect to the inner product.
- Follow the same steps as above to conclude that $\mu_p(f) \leq p$.
The construction of the convolution operator $X$ crucially uses the fact that symmetric matrices are unitarily diagonalizable (the symmetric matrix in question is obtained from a reversible Markov chain). The operator $X$ is not strictly unitarily diagonalizable, since its eigenvectors are only orthogonal with respect to the skewed inner product (in which the stationary distribution of the Markov chain crops up), but it is obtained from such a matrix through scaling of rows.
The material is taken from a series of papers by Ehud Friedgut and friends. I will refer to them by their current numbers on his list.
- The general method (including the bound on "agreeing families of colored sets") is #16.
- The application to permutations is #2 (there are some follow-up papers by David Ellis).
- The application to graphs is #1.
- The spectral proof of Erdős-Ko-Rado is #7.
- The general construction (using the crucial property that symmetric matrices are unitarily diagonalizable) is #6.
- The connection between #7 and #6 is explained in the last section of #5.