Multivariate normal and multivariate Bernoulli

Solution 1:

Regarding the second part of the question: say ${\bf X} = \{X_1, X_2 ... X_d\}$ is a $d-$dimensional Bernoulli random variable. The joint probability function, in general, takes $2^d$ values, though the condition $\sum p({\bf x})=1$ leaves us with $2^d-1$ degrees of freedom. To represent it in some compact and handy way, the following representation is possible (I devised it some years ago for some problem I had at hands; I guess it's well known).

I assume that our Bernoulli takes values in $\{-1,1\}$ instead of the more usual $\{0, 1\}$. Then, the joint probability can be written as

$$P_{\bf X}(x_1,x_2,\cdots x_d)=\frac{1}{2^3} \sum_{S_\lambda} a_{S_\lambda} \prod_{k \in S_\lambda} x_k$$

where $S_\lambda$ are all the subsets of the set $\{1, 2, \cdots ,d\}$ and $a_{S_\lambda} = E[\prod_{k \in S_\lambda} X_k]$ ($a_{\emptyset}=1$).

For example, for $d=3$:

$$P_{\bf X}(x_1,x_2,x_3)=\\ =\frac{1}{2^3}\left( a_{123} x_1 x_2 x_3 + a_{12} x_1 x_2 + a_{23} x_2 x_3 + a_{13} x_1 x_3 + a_{1} x_1 + a_{2} x_2 + a_{3} x_3 + 1 \right)$$

where $a_{123}=E[X_1 X_2 X_3]$ $a_{12}=E[X_1 X_2]$, $a_{2}=E[X_2]$, etc.

This representation has two advantages: first, the coefficients do not change if we add or suppress (marginalize) components: if we have the above formula and want to compute $P(x_1,x_2)$ or $P(x_1,x_2,x_3,x_4)$, we only must add or suppress terms, (and change the normalization factor). Second, independent components (recall that independent is the same as uncorrelated here) are easily spotted: for example, if $x_2$ is independent from $x_1$, then we'll have: $a_{12}=a_{1} a_2$. Further, $cov(X_1,X_2)=a_{12} - a_1 a_2$

A disadvantage of this representation is that, the condition $|a_{i...k}|\le1$ is necessary but not sufficient to guarantee that it corresponds to a valid probability function. To check that, we must recover the values of the probability function. But the relation between this ${\bf p} =(p_{\emptyset},p_1,p_2 ... p_{12} ... p_{123})$ and the coefficients ${\bf a} =(a_{\emptyset},a_1,a_2 ... a_{12} ... a_{123})$ is straightforward: ${\bf a} = {\bf M p} $ with $m_{i,j}=\pm 1$ (the sign depends on the parity of common elements in the sets corresponding to that matrix row and column), and ${\bf M^{-1}}= {\bf M}^t / 2^d$ (I leave some details out, I doubt they will be missed).

So, if we are given the first and second moments of ${\bf x}$, we automatically have the first and second order coefficients. The rest of the coefficients are arbitrary, except for the restriction that they must lead to a valid probability function ($0 \le p_{i..k} \le 1$).

Regarding the first question, the idea is interesting, but I doubt there is some "maximum" distance. A mere suggestion: given that we are restricting to a variable with fixed first and second moments, to compute its deviation from the corresponding multivatiate gaussian I'd try with the entropy (substracted for the entropy for the corresponding gaussian, which should be bigger).

Added: To get explicitly ${\bf M}$: Note that in my notation for the joint probability function ${\bf p} =(p_{\emptyset},p_1,p_2 ... p_{12} ... p_{123})$ the subindexes denote the components that take positive value; hence, for example $p_{1}=P(X_1=1,X_2=-1,X_3=-1)$, $p_{23}=P(X_1=-1,X_2=1,X_3=1)$, etc.

Say we want to compute $a_{12}$.

$$a_{12}=E(X_1 X_2) = p_{123} (1)(1) + p_{12} (1)(1) + + p_{23} (-1)(1) + \cdots + p_1 (1)(-1) + p_{\emptyset}(-1)(-1)$$

Or

$$a_i = \sum_j (-1)^{\#(S_i \setminus S_j)} p_j$$

where the indexes $i,j$ run over the $2^n$ subsets, and $\#(S_i \setminus S_j)$ is the cardinality of the difference set operation. That term, then, gives the elements of the matrix ${\bf M}$

Update: The matrix ${\bf M}$ is a type of Hadamard matrix, specifically (if I'm not mistaken) a Walsh matrix.