Probability of distribution of intersections between two binary arrays

Have two arrays $\vec{x}$ and $\vec{y}$, both of length $N$. They are binary (filled with 1's and 0's). We know that

$\sum_i x_i = N_x$ and

$\sum_i y_i = N_y$

Let $perm(\vec{x})$ denote a random permutation of the elements of an array. Thus define

$\vec{x}' = perm(\vec{x})$ and

$\vec{y}' = perm(\vec{y})$

I am interested in finding an analytic expression for the probability $P[C = c]$ of the number of randomly-intersecting elements, namely

$C = \sum_i x_i' y_i'$

If exact expression does not have a closed form, a good approximation would also be helpful.

The origin of this problem is comes from optics. I have two multichannel recordings before and after I do something. I want to test whether the number of channels co-active in both situations can be explained by the null hypothesis that the exact channels active at every moment in time are completely random.

My Attempt No 1:

The problem can be reformulated as follows: Assume there are two urns:

  • Urn $X$ has $N_x$ white and $N-N_x$ black balls
  • Urn $Y$ has $N_y$ white and $N-N_y$ black balls.

We draw one ball from each urn without replacement, and check if both balls are white. Then repeat until all balls are drawn. We are interested in the probability that we will draw a pair of white balls exactly $C$ times.

Now, if we relax the problem and allow for draws with replacement, it is easy to see that $P[C=c] \sim Bin(c, N, p)$ is a binomial distribution with $p=\frac{N_x}{N} \cdot \frac{N_y}{N}$. Since the original problem requires us to draw without replacement, it seems that the answer might be some form of a hypergeometric distribution. However, original hypergeometric distribution deals with only 1 urn. I need an extension that deals with matching 2 urns.


Solution 1:

Inspired by some other question and answers about the probability of a certain Hamming distance, for example here, I found the following formula:

$$P(N,N_x,N_y,c) = \frac{{N_y \choose c}{N - Ny \choose Nx - c}{N \choose Ny}}{{N \choose Nx}{N \choose Ny}} = \frac{{N_x \choose c}{N - Nx \choose Ny - c}{N \choose Nx}}{{N \choose Nx}{N \choose Ny}}$$

assuming ${n \choose k} = 0$ when $n \lt k$.

Note that if $N - Ny \ge N_x - c$ then $N - N_x \ge N_y - c$ and vice versa.

I tried it numerically and it holds for all cases with $N \le 10$.

The denominator is the number of all couple of arrays.

To build the numerator, we can think of choosing a couple $(\vec{x},\vec{y})$ satisfying the intersection requirement, then ${N_y \choose c}$ are all the ways that the $\vec{y}$ ones can be used to form the intersection, while ${N-N_y \choose N_x-c}$ are all the ways that the $\vec{y}$ zeroes can be assigned to the remaining $N_x-c$ ones of $\vec{x}$, all that multiplied by ${N \choose N_y}$, the number of $\vec{y}$ arrays. Okay, maybe somebody can help to justify that better!

EDIT: additional explanation using generating functions.

We can apply generating functions to obtain the above formula, in the way explained in this answer.

Suppose we choose just one $\vec{y}$ and we may assume $y_i=1$ for $i=1, \dots, N_y$ (the order is not important here). We have a system of two equations:

$$\begin{cases} x_1 + \ldots + x_{N_y} = c \\ x_1 + \ldots + x_N = N_x \\ \end{cases} $$

The coefficients of the first equation are $a_{1i}=y_i$ ($i=1,\dots,N$) and those of the second equation are $a_{2i}=1$ ($i=1,\dots,N$). The generating function is:

$$g(z_1,z_2)=\prod_{i=1}^{N}{\left(1+\prod_{j=1}^{2}{z_j^{a_{ji}}}\right)}=\left(1+z_1z_2\right)^{N_y}\left(1+z_2\right)^{N-N_y}=\left[\sum_{i=0}^{N_y}{{N_y \choose i}z_1^iz_2^i}\right]\left[\sum_{j=0}^{N-N_y}{{N-N_y \choose j}z_2^j}\right]=\sum_{i=0}^{N_y}{\sum_{j=0}^{N-N_y}{{N_y \choose i}{N-N_y \choose j}z_1^iz_2^{i+j}}}$$

and we need to get the coefficient of the term with $i = c$ and $i+j=N_x$ (and thus $j = N_x - c$) which is:

$$[z_1^{c}z_2^{N_x}]g(z_1,z_2)={N_y \choose c}{N-N_y \choose N_x-c}$$

I just saw now that the result is exactly the probability mass function of the hypergeometric distribution that you considered.