Making 400k random choices from 400k samples seems to always end up with 63% distinct choices, why?
I have a very simple simulation program, the sequence is:
- Create an array of 400k elements
- Use a PRNG to pick an index, and mark the element (repeat 400k times)
- Count number of marked elements.
An element may be picked more than once, but counted as only one "marked element".
The PRNG is properly seeded. No matter how many times I run the simulation, I always end up getting around 63% (252k) marked elements.
What is the math behind this? Or was there a fault in my PRNG?
Solution 1:
No, your program is correct. The probability that a particular element is not marked at all, is about $\frac{1}{e}$. This comes from the poisson-distribution which is a very well approximation for large samples (400k is very large). So $1-\frac{1}{e}$ is the fraction of marked elements.
Solution 2:
Let $X_k\in \{0, 1\}$ indicate if entry $k$ is unmarked (in which case $X_k=1$). Then the expected number of unmarked items $X$ in an array of $N$ is $$\mathbb{E}(X) = \mathbb{E}\left(\sum_{k=1}^N X_k\right) = \sum_{k=1}^N\mathbb{E}(X_k) = N \, \left(1-\frac{1}{N}\right)^N \approx N \, e^{-1}.$$
The expected number of marked items is therefore approximately $N \, (1-e^{-1})$ or $N \cdot 0.63212\cdots$ which matches your observations quite well. To make the approximation more precise one can show that $$ N\,e^{-1} -\frac{1}{2e(1-1/N)}< N\left(1-\frac{1}{N}\right)^N < N\,e^{-1} -\frac{1}{2e}$$ for all $N\geq 2$.
Solution 3:
This problem was recently solved in a slightly more general form using the concept of throwing $m$ balls into $n$ boxes:
We throwing $m$ balls to $n$ cells....
Consider $n$ boxes (the fixed list in our case). We now select $m$ items from the list at random (with replacement) -- or by throwing $m$ balls into $n$ boxes. That problem found the expected fraction of non-empty or marked boxes as $1-(1-1/n)^m$ using the same approach as @WimC. So if we made $m=800k$ choices from the list of $n=400k$ items then the expected fraction of the $n$ items that are marked would be $1-(1-\frac {m/n}{m})^m\approx 1-e^{-m/n}=1-e^{-2}.$