Probability $k$ bins are non-empty

Solution 1:

(Note: The analysis in this answer is for all $k$ bins we query being distinct. This is not what we care about in the case of Bloom filters, so I've left another answer with the right analysis. The stuff in this post is still useful, so I'm not deleting this answer yet.)

To avoid the incorrect independence assumption (though it turns out to be a good approximation), we can do the analysis in the following way.

Let $E_i$, for $1 \le i \le k$, be the event that the $i$th bin (out of the $k$ bins we're interested in) is empty after the $N = nk$ balls have been inserted. For $E_i$ to happen, each of the $N$ balls must have gone to any of the other $m - 1$ bins other than bin $i$. So $$\Pr(E_i) = \left(1 - \frac1m\right)^N$$ What about $\Pr(E_i \cap E_j)$? For both those bins to be empty, all the $N$ balls must have gone to the $m - 2$ bins other than bin $i$ and bin $j$, so $$\Pr(E_i \cap E_j) = \left(1 - \frac2m\right)^N.$$ Similarly, the intersection of some number (say $r$) of the $E_i$s has probability $\left(1 - \frac{r}m\right)^N$.

Finally, by the inclusion-exclusion principle, we have

$$\begin{align} &\phantom{=}\Pr(\text{all $k$ bins are non-empty}) \\ &= 1 - \Pr(\text{at least one bin is empty}) \\ &= 1 - \Pr(\bigcup_{i=1}^k E_i) \\ &= 1 - \sum_{i}\Pr(E_i) + \sum_{i<j}\Pr(E_i \cap E_j) - \sum_{i<j<k}\Pr(E_i \cap E_j \cup E_k) + \dots \\ &= 1 - k\left(1 - \frac1m\right)^N + \binom{k}{2}\left(1 - \frac2m\right)^N - \binom{k}{3}\left(1 - \frac3m\right)^N + \dots \end{align}\\ = \sum_{r = 0}^k \binom{k}{r}(-1)^r\left(1 - \frac{r}m\right)^N $$ as the exact probability.


From this we can see that $\displaystyle (1 - e^{-N/m})^k$ is indeed a good approximation (optimizing this gives $k = \frac{m}{n}\ln 2$ as the best value of $k$), as it's equal to (by the binomial theorem; compare with the exact expression above): $$\sum_{r = 0}^k \binom{k}{r}(-1)^r e^{-Nr/m}.$$

The two are close, term-for-term, because for large $N$, $\left(1 + \frac{x}{N}\right)^N \approx e^x,$ and so $\left(1 - \frac{r}m\right)^N = \left(1 + \frac{-Nr/m}{N}\right)^N \approx e^{(-Nr/m)}.$ Concretely, $\left(1 - \frac{r}m\right)^N$ and $e^{(-Nr/m)}$ are, respectively $$1 - N\frac{r}{m} + \binom{N}{2}\left(\frac{r}{m}\right)^2 + \binom{N}{3}\left(\frac{r}{m}\right)^3 + \dots$$ and $$1 - N\frac{r}{m} + \frac{N^2}{2!}\left(\frac{r}{m}\right)^2 + \frac{N^3}{3!}\left(\frac{r}{m}\right)^3 + \dots.$$

Solution 2:

(Note: This is the correct analysis for the $k$ queries being not necessarily distinct, but I've posted this as a separate answer because my previous one was already too long and messy.)

After the $N = nk$ balls are put in the $m$ bins, suppose a fraction $q$ of bins are empty. When we make the $k$ queries, the probability that a particular query encounters a non-empty bin is exactly $(1 - q)$, as the query might encounter any of the $m$ bins with equal probability. Further, even though the bins are not independent, the $k$ queries we make are completely independent, so the probability that they all encounter non-empty bins is exactly $(1-q)^k$. The final answer (the probability) is got by considering all possible values for $q$, weighted by their probability: the probability of all $k$ queries seeing a nonempty bin is $$\sum_{\alpha} \Pr(q = \alpha)(1-\alpha)^k$$ where the sum is over values $\alpha \in [0, 1]$ that $q$ can take, namely values of the form $\frac{r}{m}$ for $0 \le r \le m$. No approximation here.

The question of independence of the bins comes up when we try to say what the fraction $q$ might be. The probability of a particular bin (say bin $i$) being empty is $p = \left(1 - \frac1m\right)^N$, since for this bin to be empty, each of the $N$ balls must have gone to any of the other $(m - 1)$ bins other than this one. This probability $p$ is also the expected value of the fraction $q$.
(To prove this rigorously: define the indicator variable $X_i$ to be $1$ if bin $i$ is empty, and $0$ otherwise. The total number of empty bins $X = X_1 + X_2 + \dots + X_m$ has expected value $E[X] = E[X_1 + \dots + X_m] = E[X_1] + \dots + E[X_m] = mE[X_1]$ by linearity of expectation, so the expected value of the fraction $q$ (which is the same as $X/m$) is $E[q] = E[X/m] = E[X]/m = E[X_1] = p.$)

The actual value of $q$ can vary, and be different from its expected value $p$. However, the probability of its being far from $p$ is very low: as we move away from $p$, the probability of $q$ taking that value falls off exponentially. As $p \approx e^{-N/m}$, this also means that the probability of $q$ being far from $e^{-N/m}$ is also very low. (We still need to prove this.)

So the probability of all $k$ bins being nonempty, which is $\sum_{\alpha} \Pr(q = \alpha)(1-\alpha)^k$, is effectively the same as $\left(1 - e^{-N/m} \right)^k$, as $q$ takes on a value around $e^{-N/m}$ with overwhelmingly high probability.


Left to prove: that $q$ is unlikely to be far from $e^{-N/m}$. It's a bit cumbersome to analyze the fraction $q$, as it depends on all the bins, and they are not independent. One approach is to bound the error caused by the independence assumption. Mitzenmacher and Upfal, in their book Probability and Computing: Randomized Algorithms and Probabilistic Analysis, give an elegant technique for doing precisely this.

Let us focus on a particular bin $i$ of the $m$ bins. Consider the number of balls in bin $i$, after $N$ balls have been dropped independently and uniformly at random into the $m$ bins. For each bin $i$, this number (call it $X_i$) follows the binomial distribution: the probability that the bin has $r$ balls is
$\Pr[X_i = r] = \binom{N}{r}\left(\frac1m\right)^r\left(1-\frac1m\right)^{N-r}.$ This is approximately a Poisson distribution with parameter $\lambda = \frac{N}{m}$, or in other words $\Pr[X_i = r] \approx e^{-\lambda}(\lambda^r / r!)$.

Motivated by this, for the $m$ bins all viewed together, they introduce what they call the Poisson approximation. Consider $m$ bins with the number of balls in each bin $i$ (call this number $Y_i$) as being independently distributed, following a Poisson distribution with parameter $\lambda = \frac{N}{m}$. Of course, under this distribution, the total number of balls across the $m$ bins could vary, though it's indeed $N$ in expectation. However, it is a surprisingly straightforward exercise to prove that

  1. The distribution of $(Y_1, \dots, Y_m)$ when conditioned on $\sum_{i=1}^{m} Y_i = N$ is the same as the distribution of $(X_1, \dots, X_m)$.
  2. Any event that takes place with some probability $p$ in the "Poisson approximation scenario" takes place with probability at most $pe\sqrt{N}$ in the "exact scenario".

This inequality above is proved by showing that for any nonnegative function $f(x_1, \dots, x_m)$ (such as the indicator function of an event), we have $$\begin{align} E[f(Y_1, \dots, Y_m)] &\ge E\left[f(Y_1, \dots, Y_m) | \sum_i Y_i = N\right]\Pr(\sum_i Y_i = N) \\ &= E[f(X_1, \dots, Y_m)] \Pr(\sum_i Y_i = N) \\ &\ge E[f(X_1, \dots, Y_m)] (1 / e\sqrt{N}) \end{align} $$ as $\Pr(\sum_i Y_i = N) = e^{-N} (N^N / N!)$ and $N! \le e\sqrt{N} (N/e)^N$.

So, they prove something like(?) $\Pr(|q - p|) \ge \epsilon) \le 2e\sqrt{m}e^{-m\epsilon^2 / 3p}$ using this and the Chernoff bound.

In fact, they prove using martingales and the Azuma-Hoeffding inequality (12.5.3, p. 308) that $\Pr(|q - p| \ge \frac{\lambda}{m}) \le 2\exp(-2\lambda^2/m)$.

They even have an exercise (12.19, p. 313) showing that $\Pr(|q - p| \ge \frac{\lambda}{m}) \le 2\exp(-\lambda^2(2m - 1) / (m^2 - p^2m^2))$.