Sampling without replacement expectation proof

Here is my problem and subsequent attempted proof:

Assume we have $n$ different colored balls and $k$ of each color for a total of $nk$ balls. We want to draw from this set without replacement until we have seen at least one of each color.

Let $D_m$ define the number of distinct colored balls seen after $m$ draws from the defined set. Therefore, $D_m$ has the same probability distribution as the number of draws with replacement until each color is drawn. When sampling with replacement, we have that $m \approx n \log n$ (see coupon collector problem) so this will serve as an upperbound on $\mathbb{E}[D_m]$. Furthermore, if $k = 1$ then we have $n$ different balls and $n$ balls total, so we must draw each one to see each type.

The probability of drawing any ball on drawing $m$ is $$1 - \left(1 - \frac{1}{nk}\right)^m$$ Therefore, the expectation of $D_m$ is $$\mathbb{E}[D_m] = nk \left[ 1 - \left(1 - \frac{1}{nk}\right)^m\right] \rightarrow nk \left(1 - e^{-m/nk}\right) \text{ for $n \gg 1$}$$ Now let $m = c \cdot n \log n$ for some $c \in (0,1)$. This simplifies our expectation to $$\mathbb{E}[D_m] = nk \left(1-e^{-c \log n/k}\right) < nk\left(1-e^{-\log n/k}\right)$$ We now consider two cases, dependent upon the relative value of $k$ to $n$:

$\underline{\textit{Case I:}}$ $\lim_{n \rightarrow \infty}\frac{k}{\log n} = 0$ $$\lim_{n \rightarrow \infty}\mathbb{E}[D_m] = nk(1-0) = nk$$

$\underline{\textit{Case II:}}$ $\lim_{n \rightarrow \infty}\frac{k}{\log n} = \infty$

In this case, since $n \log n < nk$ we have by the sample-with-replacement bound that $$\mathbb{E}[D_m] = n \log n < nk$$ Therefore, if $k < \log n$ for all $n \in \mathbb{Z}$, in expectation we will draw every ball to see each type, whereas when $k > \log n$ we need not.

My question is the validity of the approach and namely if two of my "hand-waving" steps are valid: (1) imposing the $n \log n$ upperbound on expectation and (2) definition of $D_m$ as it relates to the sampling with replacement problem?


Solution 1:

Mistakes in your analysis

To make things clear, let $N$ be the number of draws it takes to see all colors, so $E[N]$ is your quantity of interest.

Therefore, $D_m$ has the same probability distribution as the number of draws with replacement until each color is drawn.

There are two things wrong with this:

  • First, a unit mismatch. For $D_m$, the number of draws is fixed, and the number of distinct colors scene is random. This is not the same as having the number of colors be fixed, and the number of draws to reach that target being random. Therefore, there is no relation between $D_m$ and $N$.

    • What you can probably say is that $E[N]\approx n\log n$.
  • Your definition of $D_m$ still fits in a "without-replacement" paradigm. You could say that it is approximately with-replacement.

The probability of drawing any ball on drawing $m$ is $1−(1−{1\over nk})^m$. Therefore, the expectation of $D_m$ is $\mathbb{E}[D_m] = nk \left[ 1 - \left(1 - \frac{1}{nk}\right)^m\right] \rightarrow nk \left(1 - e^{-m/nk}\right) \text{ for $n \gg 1$}$

  • This calculation assumes drawing with replacement, which is only approximately correct. Approximations should always be notated with $\approx$.

  • The calculation still computes the approximation incorrectly. The probability that a color is present after $m$ draws with replacement would be $$ 1-(1-\tfrac1{\color{red}{n}})^m \approx 1-e^{m/n} $$ Then, linearity of expectation would imply $$ E[D_m]\approx n(1-e^{m/n}) $$ The probability without replacement is $$ 1-\binom{(n-1)k}{m}\Big/\binom{nk}{m}, $$ so the exact expression is $$ E[D_m]=n\left(1-\binom{(n-1)k}{m}\Big/\binom{nk}{m}\right) $$

Therefore, if $k < \log n$ for all $n \in \mathbb{Z}$, in expectation we will draw every ball to see each type, whereas when $k > \log n$ we need not.

The problem here is your language is imprecise. What does "in expectation we will draw every ball to see each type" mean? In order to find $E[N]$, you will need a different method.


For each $i\in \{1,\dots,n\}$, $A_{i,m}$ be the event that you do not see color number $i$ in the first $m$ draws. Note $$ P(A_{1,m}\cap A_{2,m}\cdots \cap A_{j,m})=\binom{(n-j)k}{m}\Big/\binom {nk}m $$

Now, letting $B_m$ be the event that some color is missing after $m$ draws, we can use the above result, along with the principle of inclusion exclusion, to conclude $$ P(B_m)=\sum_{j=1}^n(-1)^{j-1}\binom{n}{j}\binom{(n-j)k}{m}\Big/\binom {nk}m $$ Warning: For this formula to work, we need to use the following convention: $\binom{a}{b}$ is defined to be zero when $b>a$ is negative. This means we do not have to worry about summation bounds.

Finally, $$ E[N]=\sum_{m=0}^{\infty}P(N>m)=\sum_{m=0}^{nk} P(B_m)= \sum_{m=0}^{nk}\sum_{j=1}^n(-1)^{j-1}\binom{n}{j}\binom{(n-j)k}{m}\Big/\binom {nk}m $$

Solution 2:

Let $H_n=\frac11+\frac12+\dots+\frac1n$ be the $n^{th}$ harmonic number, and let $$ H_{n,p}=\sum_{i=1}^n\frac1{i^p} $$ be the $(n,p)^{\text{th}}$ generalized harmonic number. Recall $H_n\approx \log n+\gamma$, and $H_{n,2}\approx \zeta(2)-\int_{n+1}^\infty x^{-2}\,dx=\zeta(2)-\frac{1}{(n+1)}$. That is, for large $n$, $H_{n,2}$ is approximately constant.

Answer: The expected number of balls drawn has the following first-order asymptotic expansion: $$ nH_n -\frac1k\left[n\left(\frac{(H_n)^2+H_{n,2}}2\right)-H_n\right] +O\left({1\over k^2}\right)\qquad \text{as }k\to\infty $$ The dominant term is $nH_n$, while the second term is approximately equal to $\frac1k n (\log n)^2$ when $n$ is large enough. This shows that as long as $k\gg n (\log n)^2$, then the expected value is very close to that of the coupon collector problem without replacement.

Proof: It turns out there is a very simple expression for the expected number of balls drawn: $$ E[\text{# balls drawn}] =\sum_{i=1}^n (-1)^{i-1}\binom{n}{i} \frac{nk+1}{ik+1} \tag 1 $$

I gave a proof here: Coupon collector without replacement. In order to estimate how large we need to make $k$ in order for this to be close to the limiting value $nH_n$, let us expand the fraction $\frac{nk+1}{ik+1}$ into negative powers of $k$: $$ \begin{align} \frac{nk+1}{ik+1} &=(nk+1)\frac{(ik)^{-1}}{1-(-ik)^{-1}} \\&=(nk+1)\left[\frac1{ik}-\frac1{i^2k^2}+\frac1{i^3k^3}-\dots\right] \\&= \frac{n}{i}+\frac1k\left(\frac1i-\frac{n}{i^2}\right)+O\left(\frac1{k^2}\right) \quad \text{as $k\to\infty$}\tag2 \end{align} $$ Combining $(1)$ and $(2)$, the expected number of draws is $$ n\left(\color{#a43}{\sum_{i=1}^n(-1)^{i-1}\binom{n}i\frac1i}\right) + \frac1k\left[\left(\color{#a43}{\sum_{i=1}^n(-1)^{i-1}\binom{n}i\frac1i}\right) - n\left(\color{#207}{\sum_{i=1}^n(-1)^{i-1}\binom{n}i\frac1{i^2}}\right)\right] + O\left(\frac1{k^2}\right) $$ Therefore, in order to complete the proof, it suffices to prove the combinatorial summations $$ \begin{align} \color{#a43}{\sum_{i=1}^n(-1)^{i-1}\binom{n}i\frac1i} &= H_n \tag3 \\ \color{#207}{\sum_{i=1}^n(-1)^{i-1}\binom{n}i\frac1{i^2}} &= \frac{(H_n)^2+H_{n,2}}2\tag4 \end{align} $$ Wikipedia has a proof of $(3)$ (proof of (3)), and you can prove $(4)$ by similar methods.