probability distribution of coverage of a set after $X$ independently, randomly selected members of the set

I have a set of numbers where I am randomly and independently selecting elements within a set . After a number of these random element selections I want to know the coverage of the elements in the set. Coverage being how many elements from the set have been selected at least once divided by the total number of elements in the set.

To restate this: what is the probability distribution of the different coverage values on a set after $X$ randomly, independently selected elements of the set?


If there are $n$ elements of the set then the probability that $M=m$ have been selected after a sample of $x$ (with replacement) is

$$\frac{S_2(x,m) \; n!}{n^x \; (n-m)!} $$

where $S_2(x,m)$ is a Stirling number of the second kind.

The expected value of $M$ is: $n \left(1- \left(1-\dfrac{1}{n}\right)^x \right)$.

The variance is: $n\left(1-\dfrac{1}{n}\right)^x + n^2 \left(1-\dfrac{1}{n}\right)\left(1-\dfrac{2}{n}\right)^x - n^2\left(1-\dfrac{1}{n}\right)^{2x}. $


The expected proportion of elements covered, $E\left(\frac{m}{n}\right)$, has a simple limiting form as $n \rightarrow \infty$ with the sampling rate $ r / n $ fixed. Note that $\lim_{n \rightarrow \infty} \left(1-\frac{1}{n}\right)^n = e^{-1}$, and rewrite:

$$\lim_{n \rightarrow \infty} E\left(\frac{m}{n}\right) = 1 - e^{-\frac{r}{n}}$$

so that for example sampling $r=n$ times is expected to cover about 63% of the set. This is a reasonable approximation even for $n > 100$.


Derivation of $\operatorname E [M]$ with a classic use of indicator variables and total expectation:

We sample from set $X = \{1, \dots, n \}$. Let $X_i$ be $1$ if member $i$ is in our sample, $0$ otherwise.

For a sample of size $x$, the probability that none of the values are $i$ is $\left(\frac{n-1}{n}\right)^x$. Thus the probability that the sample includes $i$, and therefore $X_i = 1$, is $1-\left(\frac{n-1}{n}\right)^x$.

We have $$\operatorname E[M] = \operatorname E[X_1 + \dots + X_n] = \operatorname E[X_1] + \dots + \operatorname E[X_n] = n \left(1-\left(\frac{n-1}{n}\right)^x\right)$$

Variance is calculated similarly, but we have to consider separately $\operatorname E[X_i^2] $ and $\operatorname E[X_i X_j]$ for $ i \ne j$.