Minimize ${tr(\mathbf{X}^T\mathbf{P}\mathbf{X})}/{tr(\mathbf{X}^T\mathbf{Q}\mathbf{X})}$ s.t. $\mathbf{X}^T\mathbf{X}=\mathbf{I}$
We have
$$C(\mathbf{X})=\frac{tr(\mathbf{X}^T\mathbf{P}\mathbf{X})}{tr(\mathbf{X}^T\mathbf{Q}\mathbf{X})}$$ where $\mathbf{X}$ is an $N$x$M$ matrix of unknowns and $\mathbf{P}$ and $\mathbf{Q}$ are $N$x$N$ positive definite constant matrices. We want to find $\mathbf{X}$ that minimizes $C$ and $\mathbf{X}^T\mathbf{X}=\mathbf{I}$ .
Solution for the case without the orthonormality constraint could be found here.
* solution to the special case of $N=3, M=2, Q=I$ would also be of interest *
Some context:
To give some context and geometric interpretation, assume we have $N_s$ points in an $N$ dimensional space, given by $u_1...u_{N_s}$, each point has a label (or assigned to a class), for simplicity, assume there are two classes: red and blue. We want to find a mapping from $N-$ dimensional space to $M-$ dimensional space, mapping $u_1...u_{Ns}$ to $v_1...v_{Ns}$ where $v_i = X^Tu_i$ with the property that in the $M$ dimensional space, the points with same color are close to each other relative to points with different colors. Using the above formulation, we find $P$ and $Q$ as $P=A^TA$ where $A$ is the matrix with rows consisting of all pairwise distances (u_i - u_j) where $i,j$ are the indices of points with similar label and similarly $Q=B^TB$ where $B$ is the matrix with all pairwise distances for points with different labels.
This could also be modelled as a projection to the $M$ dimensional space, that retains inter-class variation but minimizes intra-class variation. We hope to use this as a general method for dimensionality reduction with special application in batch effect removal and classification for genomic data.
Here is a numerical example: We have 8 points in $N=3$ dimensional space given by their coordinates:
$$ \begin{bmatrix} \mathbf{x} & \mathbf{ y} & \mathbf{z} & \mathbf{label}\\ 18 & 5 & 17 & blue \\ 24 & 2 & 10 & blue \\ 12 & 13 & 1 & blue \\ 11 & 7 & 9 & blue \\ 20 & 4 & 22 & red \\ 8 & 15 & 14 & red \\ 21 & 23 & 3& red \\ 19 & 16 & 6& red \end{bmatrix} $$
Then for matrices $P$ and $Q$ we have:
$$ P=\begin{bmatrix} 875& -267& 79\\ -267& 999& -1033\\ 79& -1033& 1390 \end{bmatrix} $$
$$ Q=\begin{bmatrix} 884 & -174& 103\\ -174 &1960 &-785\\ 103 &-785& 1454 \end{bmatrix} $$
For the case $M=1$, optimal $X$ given by the eigenvector corresponding to the smallest eigenvalue of $Q^{-1}P$ is $X=[0.198, 0.782, 0.591]^{T}$ and the transformed values $v = X^T u = [17.53, 12.23, 13.13, 12.97, 20.10, 21.59, 23.91, 19.82]$
Following is the plot for the data used in the above example:
Here is the projected points for the optimal $X$ obtained above:
So, for $M=2$, we want to find an orthonormal transformation matrix $X$ that maps the data points to a two dimensional space (instead of 1-dimensional space in the above example). We want the red dots to be close to each other and the blue dots be close to each other and far from the red dots in the transformed space.
$\textbf{Proposition 1.}$ If $X$ realizes a local extremum of $C$ under the condition $X^TX=I_M$, then necessarily
$(1)$ $(I_N-XX^T)(tr(X^TQX)PX-tr(X^TPX)QX)=0$.
$\textbf{Proof}$. Step 1. We seek the tangent space $T_X$ of the variety $X^TX-I=0$. $H\in T_X\cap M_{N,M}$ iff $H^TX+X^TH=0$, that is $X^TH=K$, a skew symmetric $M\times M$ matrix. Since $X^TX=I$, $X^+=X^T$ and
$H=XK+(I_N-XX^T)W$ where $W\in M_{N,M}$ is arbitrary.
Step 2. If $X$ reaches a local extremum, then, for every $H\in T_X$, $DC_X(H)=0$.
$DC_X(H)=\dfrac{2}{(tr(X^TQX))^2}(tr(X^TQX)tr(H^TPX)-tr(X^TPX)tr(H^TQX))$.
Then, for every $K\in M_M$ skew symmetric and $W\in M_{N,M}$
$tr(X^TQX)tr(-KX^TPX+W^T(I-XX^T)PX)-tr(X^TPX)tr(-KX^TQX+W^T(I-XX^T)QX)=0$.
Clearly $tr(KX^TPX)=tr(KX^TQX)=0$. Since that remains is $0$ for every $W$, we deduce the relation $(1)$.
$\textbf{Remarks.}$ i) Relation $(1)$ is complicated because it has degree $5$ wrt. the unknowns $(x_{i,j})$.
ii) $(1)$ says that each of the $M$ columns of $tr(X^TQX)PX-tr(X^TPX)QX$ is in $\ker(I-XX^T)$, a vector space of dimension $M$.
EDIT 1. Particular solutions are those of $tr(X^TQX)PX-tr(X^TPX)QX=0$. Consider the case when $M=2$. Then $X=[e_1,e_2]$ is orthonormal and $PX=\alpha QX$ ($\alpha >0$) or $(Q^{-1}P)X=\alpha X$ or $Q^{-1}Pe_1=\alpha e_1,Q^{-1}Pe_2=\alpha e_2$. $e_1,e_2$ are eigenvectors associated to the same eigenvalue $\alpha$ of $Q^{-1}P$; such a solution exists only if the eigenvalue $\alpha$ is multiple; I think that, if the smallest eigenvalue of $Q^{-1}P$ is multiple, then such a solution constitutes a good candidate for the minimum.
When $N=3,M=2$, the set of solutions of the system $\{(1),X^TX=I_M\}$ (for a random choice of $P,Q$) is infinite and depends on $1$ parameter. Thus, when $M\geq 2$, the problem is no more convex.
iii) We may assume that one of $P,Q$ is a positive diagonal matrix.
EDIT 2. About the OP's toy example. For $M=2$, the required minimum is $0.51786$ (using the NLPSolve of Maple).
Same example. We search the minimum among the $\{X=[u,y];y\in \mathbb{R}^N\}$ where $u$ is the eigenvector associated to the smallest eigenvalue of $Q^{-1}P$ and $y$ a unknown vector s.t. $\{u,y\}$ is orthonormal. We obtain $0.52865$ that is not far from the true minimum.
EDIT 3. I think that there is no closed form solution even when $M=2$.
I don't know what are the values of $M, N$ that interest you. NLPSolve does the job for $N=15,M=3$ (with $30$ digits) in 31". Of course, one isn't absolutely sure of the result provided by Maple; when iterating, the current point may have fallen into a hole! Yet, no matter, when your points of the same color are well separated; I think this happens even if you do not realize exactly the minimum of $C$. For example (cf. above), when $N$ is large and $M$ is small ($M=2$), you can choose the first column of $X$ as an eigenvector associated to the smallest eigenvalue of $Q^{-1}P$; thus the number of unknowns is only $N$ (for the second column of $X$).
EDIT 4BIS. We can replace the condition $X^TX=I_M$ with $X^TQX=I_M$; then it remains to find,
$inf(tr(X^TPX))$, under the condition $X^TQX=I$.
According to the OP's good idea, we decompose (Cholesky) $Q=R^TR$ and put $X'=RX$ and $S=R^{-T}PR^{-1}>0$. Then the problem is to find
$inf(tr(X'^TSX'))$, under the condition $X'^TX=I$.
Now, we use Remark iii). We decompose $S=P^TDP$ (where $P$ is orthogonal and $D$ positive diagonal) and put $Y=PX'$. Then the problem becomes
to find $L=inf(tr(Y^TDY))$, under the condition $Y^TY=I$. We may assume that $D=diag((\lambda_i)$ where $0\leq \lambda_1\leq\cdots\leq \lambda_N$.
$\textbf{Proposition 2}$ The lower bound $L$ is $\sum_{i=1}^M \lambda_i$ and is reached by $Y=[e_1,\cdots,e_M]$ whose columns are the first $M$ vectors of the canonical basis.
$\textbf{Proof}$. If $Y=[e_1,\cdots,e_M]=[I_M,0]^T$, then $Y^TDY=[I_M,0] diag(diag(\lambda_1,\cdots,\lambda_M),D')[I_M,0]^T=diag(\lambda_1,\cdots,\lambda_M)$ and $tr(Y^TDY)=\sum_{i=1}^M \lambda_i$.
It remains to show that, if $Y^TY=I$, then $Y^TDY\geq \sum_{i=1}^M \lambda_i$. We prove that for $M=2,N=3$ (it's easy to generalize the result).
If $Y=\begin{pmatrix}a&d\\b&e\\c&f\end{pmatrix},D=diag(p,q,r)$, then $tr(Y^TDY)=p(a^2+d^2)+q(b^2+e^2)+r(c^2+f^2)=$
$p(a^2+d^2)+q(b^2+e^2)+r(1-a^2-d^2)+r(1-b^2-e^2)\geq p+q.$ $\square$