Let $C \in \mathbb{R}^{d \times d}$ be symmetric, and

$$Q = \begin{bmatrix} \vert & \vert & & \vert \\ q_1 & q_2 & \dots & q_K \\ \vert & \vert & & \vert \end{bmatrix} \in \mathbb{R}^{d\times K}$$

where $d \geq K$. Using Lagrange multipliers,

$$\begin{array}{ll} \text{maximize} & \mbox{tr} \left( Q^T C Q \right)\\ \text{subject to} & Q^T Q = I\end{array}$$


I am unfamiliar with these kind of constraints with this method, and after reading another post I believe the same specific and simple result given is also applicable, and therefore the lagrangian would be:

$$\mathcal{L}(Q,\lambda)=\mathrm{tr}(Q^TCQ)-\left<\lambda,Q^TQ-I\right>$$

where $\lambda\in\mathbb{R}^{K\times K}$, and $\left<\cdot,\cdot\right>$ is the element wise inner product (what kind of makes sense to me since we're actually adding as many constraints as there are elements in these matrices.

In attempting to do that I start taking $\frac{\partial \mathcal{L}}{\partial Q}=O\in\mathbb{R}^{d\times K}$, and compute that LHS element by element; for the $(l,m)$ one:

\begin{equation} 0=\frac{\partial \mathcal{L}}{\partial Q_{lm}}=(CQ+C^TQ)_{lm}-\underbrace{\frac{\partial}{\partial Q_{lm}}\sum_{i,j}\lambda_{i,j}(Q^TQ-I)_{ij}}_{=\lambda_{lm}\frac{\partial (Q^TQ)_{lm}}{\partial Q_{lm}}}=2(CQ)_{lm}-\lambda_{lm}\frac{\partial (q_l^Tq_m)}{\partial q_m(l)} \tag{1}\end{equation}

where in the last step I've used the definition I made at the beginning for $Q$, and $q_m(l)$ denotes the $l$-th component of the column vector $q_m$.

In trying to compute the very last term: $$\frac{\partial (q_l^Tq_m)}{\partial q_m(l)}=\frac{\partial \left[q_l(1)q_m(1)+ \ldots + q_l(d)q_m(d)\right]}{\partial q_m(l)}= \begin{cases} q_l(l)\equiv Q_{ll} & \text{if } l\neq m\\ 2q_l(l)\equiv 2Q_{ll} & \text{if} l=m \end{cases}$$

The whole equality (1) then can be written:

$$0=2(CQ)_{lm}-\lambda_{lm}Q_{ll}(1+\delta_{lm})$$

where $\delta_{lm}$ is the Kronecker delta.

The equation for the other stationary point of the lagrangian, $\frac{\partial \mathcal{L}}{\partial \lambda}=O\in\mathbb{R}^{K\times K}$, for the $(l,m)$ element as well:

$$ 0=\frac{\partial \mathcal L}{\partial \lambda_{lm}}= \frac{\partial }{\partial \lambda_{lm}}\sum_{i,j}\lambda_{i,j}(Q^TQ-I)_{ij}=(Q^TQ-I)_{lm}\tag{2}$$

what obviously leads to $(Q^TQ)_{lm}=\delta_{lm}$.

All this should tell that the columns of $Q$ are eventually the $K$ first eigenvectors of $C$, but I don't know how to continue from here to prove that, supposing I didn't make a mistake. Please I would sincerely appreciate any help.


Edit:

I have rewritten the inner product as a trace of a product of matrices (after seeing this question):

$$\left<\lambda,Q^TQ-I\right>=\sum_{i,j}\lambda_{i,j}(Q^TQ-I)_{ij}=\mathrm{tr}(\lambda^TQ^TQ) $$

and have thus managed to do the derivative without losing the matrix format (using formulas from the Matrix Cookbook):

\begin{align} O=&\frac{\partial \mathcal{L}}{\partial Q}=\frac{\partial}{\partial Q}\mathrm{tr}(Q^TCQ)-\frac{\partial}{\partial Q}\underbrace{\mathrm{tr}(\lambda^T(Q^TQ-I))}_{\mathrm{tr}(\lambda^TQ^TQ)-\mathrm{tr}(\lambda^T)}\\=&(CQ+C^TQ)-(Q(\lambda^T)^T+Q\lambda^T)=2CQ+Q(\lambda+\lambda^T) \end{align}

And this leads to:

$$CQ=Q\underbrace{\left(-\frac{\lambda+\lambda^T}{2}\right)}_{:=\widetilde{\lambda}};\quad CQ=Q$$

If the defined matrix $\widetilde{\lambda}=Q^TCQ$ were diagonal we would already have the result.


Solution 1:

Since $C$ is symmetric real we can write $C=U \Lambda U^T$ where $\Lambda$ is a diagonal matrix of eigenvalues. As $Q^T U U^T Q = I$, we can just assume $C= \operatorname{diag} (\lambda_1,...,\lambda_d)$, where $\lambda_1 \ge \cdots \ge \lambda_d$.

The problem is then $\max_{Q^TQ=I} \operatorname{tr}(Q^T \Lambda Q)$.

Note that $\operatorname{tr}(Q^T \Lambda Q) = \operatorname{tr}(Q^T Q Q^T \Lambda Q) = \operatorname{tr}( Q Q^T \Lambda QQ^T) = \operatorname{tr}(P^T \Lambda P)$, where $P=Q Q^T$.

Note that $P$ is an orthogonal projection onto a subspace of dimension $K$. Furthermore, any such orthogonal projection can be written in the form $Q Q^T$, where $Q^TQ = I$.

So now the problem is $\max_{P \text{ orthogonal projection}, \text{ rk } P=K} \operatorname{tr}(P^T \Lambda P)$.

Note that $\operatorname{tr}(P^T \Lambda P) = \sum_{n=1}^d \lambda_n \|P e_n\|^2$. Furthermore, note that $\|P\|_F^2 = K$ and so $\sum_{n=1}^d \|P e_n\|^2 = K$ with $0 \le \|P e_n\|^2 \le 1$. ($e_n$ is the $n$th unit vector.)

It is straightforward to check that $\max\{ \sum_{n=1}^d \lambda_n \mu_n | \sum_{n=1}^d \lambda_n \mu_n = K, 0 \le \mu_n \le 1 \}$ is $\lambda_1+\cdots+ \lambda_K$.

Hence $\operatorname{tr}(P^T \Lambda P) \le \lambda_1+\cdots+ \lambda_K$ and by choosing ${\cal R} P = \operatorname{sp}\{e_1,...,e_K \}$ we see that this is attained.

Solution 2:

$B: = C + \delta I$
for some $\delta \in R$ that is large enough so our real symmetric $B\succ0$

let $\Sigma_B$ be a diagonal matrix with the singular values of $B$ (which are also its eigenvalues) and $\Sigma_{QQ^T}$ have the singular values of $(QQ^T)$.

Singular values are in the usual ordering of largest to smallest
note this means $\Sigma_{QQ^T} = \begin{bmatrix} \mathbf I_k & \mathbf 0 \\ \mathbf 0 & \mathbf 0 \end{bmatrix}$

by application of von Neumann trace inequality:
$\text{trace}\big(Q^TBQ\big)$
$=\text{trace}\big((QQ^T)B\big)$
$\leq \text{trace}\big(\Sigma_{QQ^T}\Sigma_{B}\big)$
$= \sum_{i=1}^k \sigma_i^{(B)}$
$= \sum_{i=1}^k \lambda_i^{(B)}$

Making use of linearity we also know
$\text{trace}\big(Q^TBQ\big) = \text{trace}\big(Q^T(C + \delta I)Q\big)= \text{trace}\big(Q^TC Q\big) + \delta\cdot \text{trace}\big( Q^TQ\big) = \text{trace}\big(Q^TC Q\big) + \delta \cdot k$

to conclude
$ \text{trace}\big(Q^TC Q\big) $
$= \text{trace}\big(Q^TBQ\big) -\delta \cdot k $
$\leq \big( \sum_{i=1}^k \lambda_i^{(B)}\big)-\delta \cdot k$
$= \big( \sum_{i=1}^k (\lambda_i^{(B)}-\delta)\big)$
$= \sum_{i=1}^k \lambda_i^{(C)}$

and this is met with equality when you select the columns of $Q$ to be the first $k$ (mutually othornomal) eigenvectors of $B$

Solution 3:

Here's a proof with Cauchy Eigenvalue Interlacing

Given that $Q^T Q = I_k$
$A:=Q^T C Q$ has $k$ eigenvalues that interlace with those of $C$. With eigenvalues in the usual ordering of
$\lambda_1^{(C)} \geq \lambda_2^{(C)} \geq ... \geq \lambda_n^{(C)}$ and $\lambda_1^{(A)} \geq \lambda_2^{(A)} \geq ... \geq \lambda_k^{(A)}$
A crude consequence of Cauchy Interlacing is that
$\lambda_j^{(C)} \geq \lambda_j^{(A)}$ for $j\in\{1,2,...,k\}$

summing over the bound
$\sum_{i=1}^k \lambda_j^{(C)} \geq \sum_{i=1}^k\lambda_j^{(A)} = \text{trace}\big(Q^T C Q\big)$
the upper bound is met with equality when $Q$ is chosen to have the first $k$ eigenvectors of $C$

Solution 4:

A proof by Schur-Horn theorem:

Let $V = [Q \ P]$ be an orthogonal matrix. Then $Q = V\left( \begin{array}{c} I_K \\ 0 \\ \end{array} \right)$. We have \begin{align} \mathrm{Tr}(Q^{\mathsf{T}}CQ) &= \mathrm{Tr}\left([I_k \ 0]V^{\mathsf{T}}CV\left( \begin{array}{c} I_K \\ 0 \\ \end{array} \right)\right)\\ &= \mathrm{Tr}\left(V^{\mathsf{T}}CV\left( \begin{array}{c} I_K \\ 0 \\ \end{array} \right)[I_k \ 0]\right)\tag{1}\\ &= \mathrm{Tr}\left(V^{\mathsf{T}}CV\left( \begin{array}{cc} I_K & 0 \\ 0 & 0 \\ \end{array} \right) \right)\\ &= \sum_{i=1}^K (V^\mathsf{T}CV)_{i,i}. \tag{2} \end{align} In (1), we have used the well-known fact that $\mathrm{Tr}(AB) = \mathrm{Tr}(BA)$ for $A \in \mathbb{R}^{m\times n}$ and $B \in \mathbb{R}^{n\times m}$.

Thus, we turn to find an orthogonal matrix $V$ such that $\sum_{i=1}^K (V^\mathsf{T}CV)_{i,i}$ is maximized.
Let $C = U\mathrm{diag}(\lambda_1, \lambda_2, \cdots, \lambda_d)U^\mathsf{T}$ be the eigendecomposition of $C$ where $\lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_d$ are the eigenvalues of $C$ in descending order, and $U$ is an orthogonal matrix whose columns are the eigenvectors of $C$. Let $$G = V^\mathsf{T}CV = V^\mathsf{T}U\mathrm{diag}(\lambda_1, \lambda_2, \cdots, \lambda_d)U^\mathsf{T}V. \tag{3}$$ Clearly, $\lambda_1, \lambda_2, \cdots, \lambda_d$ are also the eigenvalues of $G$. Let $d = (G_{1,1}, G_{2,2}, \cdots, G_{d,d})$. Let $\lambda = (\lambda_1, \lambda_2, \cdots, \lambda_d)$. By the Schur-Horn theorem [1][2], we know that $d$ is majorized by $\lambda$ which results in $$\sum_{i=1}^K G_{i,i} \le \sum_{i=1}^K \lambda_i \tag{4}$$ with equality if $U^\mathsf{T}V = I_d$ (see (3)), i.e., $V = U$.

We conclude that the maximum of $\mathrm{Tr}(Q^{\mathsf{T}}CQ)$ is $\sum_{i=1}^K \lambda_i$ which is achieved at $Q = U\left( \begin{array}{c} I_K \\ 0 \\ \end{array} \right)$.

Reference

[1] https://en.wikipedia.org/wiki/Schur%E2%80%93Horn_theorem

[2] https://mathworld.wolfram.com/HornsTheorem.html

Definition of majorization: Let $x, y \in \mathbb{R}^n$ be given. We say that $y$ is majorized by $x$ if and only if $$\sum_{i=1}^k x_{[i]} \ge \sum_{i=1}^k y_{[i]}, \ k=1, 2, \cdots, n-1$$ and $$\sum_{i=1}^n x_{[i]} = \sum_{i=1}^n y_{[i]}$$ where $x_{[1]} \ge x_{[2]} \ge \cdots \ge x_{[n]}$ denotes a decreasing rearrangement of $x_1, x_2, \cdots, x_n$.