Making 400k random choices from 400k samples seems to always end up with 63% distinct choices, why?

I have a very simple simulation program, the sequence is:

  • Create an array of 400k elements
  • Use a PRNG to pick an index, and mark the element (repeat 400k times)
  • Count number of marked elements.

An element may be picked more than once, but counted as only one "marked element".

The PRNG is properly seeded. No matter how many times I run the simulation, I always end up getting around 63% (252k) marked elements.

What is the math behind this? Or was there a fault in my PRNG?


Solution 1:

No, your program is correct. The probability that a particular element is not marked at all, is about $\frac{1}{e}$. This comes from the poisson-distribution which is a very well approximation for large samples (400k is very large). So $1-\frac{1}{e}$ is the fraction of marked elements.

Solution 2:

Let $X_k\in \{0, 1\}$ indicate if entry $k$ is unmarked (in which case $X_k=1$). Then the expected number of unmarked items $X$ in an array of $N$ is $$\mathbb{E}(X) = \mathbb{E}\left(\sum_{k=1}^N X_k\right) = \sum_{k=1}^N\mathbb{E}(X_k) = N \, \left(1-\frac{1}{N}\right)^N \approx N \, e^{-1}.$$

The expected number of marked items is therefore approximately $N \, (1-e^{-1})$ or $N \cdot 0.63212\cdots$ which matches your observations quite well. To make the approximation more precise one can show that $$ N\,e^{-1} -\frac{1}{2e(1-1/N)}< N\left(1-\frac{1}{N}\right)^N < N\,e^{-1} -\frac{1}{2e}$$ for all $N\geq 2$.

Solution 3:

This problem was recently solved in a slightly more general form using the concept of throwing $m$ balls into $n$ boxes:

We throwing $m$ balls to $n$ cells....

Consider $n$ boxes (the fixed list in our case). We now select $m$ items from the list at random (with replacement) -- or by throwing $m$ balls into $n$ boxes. That problem found the expected fraction of non-empty or marked boxes as $1-(1-1/n)^m$ using the same approach as @WimC. So if we made $m=800k$ choices from the list of $n=400k$ items then the expected fraction of the $n$ items that are marked would be $1-(1-\frac {m/n}{m})^m\approx 1-e^{-m/n}=1-e^{-2}.$