Calculating the trace of the product of two matrices

First, let me offer some purely mathematical derivation, then we will attempt to address the storage problem once I get answers to the questions I posed in the comments above. I will edit this answer as needed.

Since $A$ is symmetric and positive definite, it admits a Cholesky factorization $A=LL^T$, where $L$ is lower triangular; and $A^{-1}=L^{-T}L^{-1}$. Let us define $M=L^{-1}$, which is also a lower triangular matrix, so $A^{-1}=M^TM$; and let $m_k$ denote the $k$th column of $M$.

Furthermore, you say that $B$ is symmetric with two non-zero elements. This means that $B$ can take one of two forms: $$B=\alpha(e_ie_j^T+e_je_i^T) \quad \text{or}\quad B=\alpha e_ie_i^T + \beta e_je_j^T$$ where $e_k$ is a vector with a one in the $k$th position and zeros elsewhere. Let's consider the first form for a moment: $$\begin{aligned} \mathop{\textrm{Tr}}(A^{-1}B)&=\alpha\mathop{\textrm{Tr}}(A^{-1}(e_ie_j^T+e_je_i^T))\\ &=\alpha\mathop{\textrm{Tr}}(A^{-1}e_ie_j^T)+\alpha\mathop{\textrm{Tr}}(A^{-1}e_je_i^T) \\ &=\alpha e_j^TA^{-1}e_i+\alpha e_i^TA^{-1}e_j = 2\alpha\left[A^{-1}\right]_{ij} \\ &= 2\alpha e_j^TM^TMe_i = 2\alpha \langle m_i,m_j \rangle \end{aligned} $$ So as you can see, the trace requires exactly one element of $A^{-1}$, or the inner product of two columns of $M$. A similar derivation for the second case yields $$\mathop{\textrm{Tr}}(A^{-1}B)=\alpha\left[A^{-1}\right]_{ii}+\beta\left[A^{-1}\right]_{jj}+\alpha\langle m_i,m_i\rangle+\beta\langle m_j,m_j\rangle$$

So hopefully now it is clear why I asked: how many $B$ matrices are there? How is $A$ stored? What kinds of operations can we perform with $A$? Those questions are essential for determining what to do in this case. For instance, if there are only a handful of unique indices $i,j$ above, then one approach is to compute each $f_i\triangleq A^{-1}e_i$ using some sort of iterative method, then use $e_j^TA^{-1}e_i=e_j^Tf_i$.

But if most of the indices $i=1,2,\dots,10000$ are represented, it may be more expedient to do some sort of Cholesky factorization on the matrix. Yes, you may not have enough memory---to do an in-core factorization. But Cholesky factorizations can be done out-of-core. This involves performing the calculations in blocks, reading in only enough data into memory to solve that particular block, and writing each block to disk before proceeding with the next.


If there really are only 2 non zero values in $B$ then you can compute $tr(A^{-1}B)$ from $A$'s determinant and 2 of its minors. A 2-element matrix is a sum of 2 1-element matrices and a 1-element matrix is the outer product of 1-element vectors, using bra-ket notation: $$ B = X + Y = \left| r_1 \right> x \left< c_1 \right| + \left| r_2 \right> y \left< c_2 \right| $$ Since trace is a linear operator $$ tr(A^{-1}B) = tr(A^{-1}(X+Y)) = tr(A^{-1}X) + tr(A^{-1}Y) $$ Let $C$ be the matrix of cofactors of $A$ $$ tr(A^{-1}X) = tr( A^{-1} \left| r_1 \right> x \left< c_1 \right| ) = x \left< c_1 \right| A^{-1} \left| r_1 \right> = x (A^{-1})_{c_1r_1} = x \left(\frac{C^{\top}}{det\,A}\right)_{c_1r_1} = x \frac {C_{r_1c_1}}{det\,A} \\ \therefore tr(A^{-1}B) = \frac {x\,C_{r_1c_1} + y\,C_{r_2c_2}}{det\,A} $$ This saves having to invert $A$. Only the determinant and 2 specific cofactors of $A$ need to be computed, so $tr(A^{-1}B)$ can be computed within a small constant factor of the cost of $det\,A$.

There has been progress in recent years on practical algorithms for determinants of large sparse matrices. This is not my area of expertise but here are some references:

  • Erlend Aune, Daniel P. Simpson: Parameter estimation in high dimensional Gaussian distributions, particularly section 2.1 (arxiv:1105.5256) (longer version published version)
  • Ilse C.F. Ipsen, Dean J. Lee: Determinant Approximations (arxiv:1105.0437)
  • Arnold Reusken: Approximation of the determinant of large sparse symmetric positive definite matrices (arxiv:hep-lat/0008007)
  • notes for an implementation in the shogun library

These methods seem to be primarily approximation methods that can compute the determinant to arbitrary accuracy at the cost of increased running time, so you can choose the balance between speed & accuracy. The also seem to avoid materializing large dense matrices in intermediate calculations