Coupon collector's problem: mean and variance in number of coupons to be collected to complete a set (unequal probabilities)

Solution 1:

A3: Using the Poisson process to magically concoct independent random variables. This is the most powerful of all approaches since it's the only one that allows us to solve for both mean and variance for the coupon collector's problem for the general case of coupons having unequal probabilities (and higher moments as well).

The other approaches either work for all moments but only the special case of equal probabilities (A1 and A2) or for the general case of unequal probabilities but only the mean (A4).

A question about this was asked and answered by me earlier: Coupon collectors problem: variance calculation missing a term.. This post is a consolidation.


In example 5.17 of the book, Introduction to probability models by Sheldon Ross, the Coupon collector's problem is tackled for the general case where the probability of drawing coupon $j$ is given by $p_j$ and of course, $\sum\limits_j p_j = 1$.

Now, he imagines that the collector collects the coupons in accordance to a Poisson process with rate $\lambda=1$. Furthermore, every coupon that arrives is of type $j$ with probability $p_j$.

Now, he defines $X_j$ as the first time a coupon of type $j$ is observed, if the $j$th coupon arrives in accordance to a Poisson process with rate $p_j$. We're interested in the time it takes to collect all coupons, $X$ (for now, eventually, we're interested in the number of coupons to be collected, $N$). So we get:

$$X = \max_{1\leq j \leq m}X_j$$

Note that if we denote $N_j$ as the number of coupons to be collected before the first coupon of type $j$ is seen, we also have for the number needed to collect all coupons, $N$:

$$N = \max_{1\leq j \leq m}N_j \tag{0}$$

This equation is less useful since the $N_j$ are not independent. It can still be used to get the mean (see answer A4), but trying to get the variance with this approach gets considerably more challenging due to this dependence of the underlying random variables.

But, the incredible fact that the $X_j$ are independent (discussion on that here), allows us to get:

$$F_X(t) = P(X<t) = P(X_j<t \; \forall \; j) = \prod\limits_{j=1}^{m}(1-e^{-p_j t})\tag{1}$$

Mean

Now, Ross uses the expression: $E(X) = \int\limits_0^\infty S_X(t)dt$, where $S_X(t)$ is the survival function to get:

$$E(X) = \int\limits_{0}^{\infty}\left(1-\prod\limits_{j=1}^{m}(1-e^{-p_j t})\right) dt $$

$$= \sum\limits_j\frac 1 p_j - \sum\limits_{i<j}\frac {1}{p_i+p_j} + \dots +(-1)^{m-1} \frac{1}{p_1+\dots+p_m}\tag{2}$$

For our case here we have: $p_j=\frac{1}{n} \forall j$

Substituting in the equation above we get:

$$E(X) = \sum\limits_{k=1}^{n}(-1)^k \frac{n \choose k}{k}$$

From here we get as before:

$$E(X) = n\sum\limits_{k=1}^n \frac{1}{k}$$

Further, Ross shows that $E(N)=E(X)$ using the law of total expectation.

First, he notes,

$$E(X|N=n)=nE(T_i)$$

where $T_i$ are the inter-arrival times for coupon arrivals. Since these are assume to be exponential with rate 1,

$$E(X|N)=N\tag{3}$$

Taking expectations on both sides and using the law of total expectation we get:

$$E(X)=E(N)$$

Variance

This approach can easily be extended to find $V(N)$, the variance (not covered by Ross). We can use the following expression to get $E(X^2)$:

$$E(X^2) = \int\limits_0^\infty 2tP(X>t)dt = \int\limits_0^\infty 2t\left(1-\prod\limits_{j=1}^n(1-e^{-p_j t})\right)dt$$

Using the fact that $\int\limits_0^\infty te^{-pt}=\frac{1}{p^2}$ and the same algebra as for $E(X)$ we get:

$$\frac{E(X^2)}{2} = \sum \frac {1} {p_j^2} -\sum_{i<j} \frac{1}{(p_i+p_j)^2}+\dots +(-1)^{n-1}\frac{1}{(p_1+\dots+p_n)^2} $$

Now, let's consider the special case where all coupons have an equal probability of being selected. In other words, $p_j=\frac 1 n \; \forall \; j$.

We get:

$$\frac{E(X^2)}{2} = n^2\left(\sum\limits_{k=1}^n (-1)^{k-1}\frac{n\choose k}{k^2}\right)$$

Per my answer to the question here, this summation yields:

$$E(X^2) = 2n^2\left( \sum_{j=1}^n\sum_{k=1}^j\frac{1}{jk}\right)\tag{4}$$

As a side-note, the binomial identity arising from the calculation of the second moment can be generalized: $$\sum_{k=1}^n(-1)^{k-1}\frac{n\choose k}{k^r}=\sum_{i_1<i_2<\dots <i_r}\frac{1}{i_1 i_2 \dots i_r}$$ See here.

Equation (4) has given us $E(X^2)$ but remember that we're interested in finding $E(N^2)$.

Using the law of total variance we get:

$$V(X)=E(V(X|N))+V(E(X|N))$$

So per equation (3) we have:

$$V(X)=E(V(X|N))+V(N)\tag{5}$$

Now,

$$V(X|N)=NV(T_i)$$

And since $T_i \sim Exp(1)$, we have $V(T_i)=1$ meaning, $V(X|N)=N$.

Substituting into (2),

$$V(X)=E(N)+V(N)$$

$$=> V(N)=E(X^2)-E(N)-E(N)^2$$

Where equation (4) gives $E(X^2)$ while $E(N)=n\sum_{k=1}^n \frac{1}{k}$ as shown multiple times on this page. This is consistent with equation (5) of A2.

Solution 2:

A4: Using maximum of minimums identity


Let $N_j$ be the number of coupons to be collected before we see the first coupon of type $j$ and $N$ the number of coupons until all are collected. We have:

$$N = \max_{1\leq j \leq n}N_j$$

This is equation (0) of answer A3 and in conjunction with maximum of minimums identity we get:

$$N = \sum N_j - \sum_{1\leq j \leq k\leq n} \min N_j, N_k + \sum_{1\leq j \leq k\leq i \leq n} \min N_j, N_k, N_i - \dots \tag{1}$$

and the fact that $\min_{1 \leq j \leq m} N_j$ is a geometric random variable with parameter $p=\sum\limits_{j=1}^m p_j$ lead to equation (2) of A3 and from there, we can substitute $p_j=\frac 1 n \forall j$ to get:

$$E(N) = n\sum\limits_{k=1}^n \frac 1 k$$

Note that it's not easy to get the variance, $V(N)$ with this approach because the terms in equation (1) are not independent.

Solution 3:

A2: Using a recurrence


Consider a state where the collector has $m$ coupons in his collection. Let $T_m$ be the number of coupons needed to complete the collection. If the total coupons he needs to collect to complete the collection is $N$, we then have:

$$N = T_0$$

Now, we could observe that (as pointed out by @DaivdK in the comments):

$$N_m = T_{m+1}-T_m$$

and summing over all $m$ (and noting that $T_n=0$) leads us to:

$$T_0 = \sum_m N_m$$

and this leads to the approach in A1 which makes the problem much easier to solve.

Alternately, we can continue working with the $T_m$'s and construct a recurrence.

Consider what happens when the collector has $m$ coupons and he collects one more. With probability $\frac{m}{n}$, he fails to add a new coupon and is back to where he started, making no progress. Let $I(\frac{n}{m})$ be a Bernoulli random variable with $p=\frac{n}{m}$. We then have the expression:

$$T_m = 1+I\left(\frac{m}{n}\right)T_m'+\left(1-I\left(\frac{m}{n}\right)\right)T_{m+1}\tag{1}$$

Where $T_m'$ is i.i.d with $T_m$. Taking expectation to both sides,

$$E(T_m) = 1+ \frac{m}{n}E(T_m)+\frac{n-m}{n}T_{m+1}$$

$$E(T_m)\left(1-\frac{m}{n}\right) = 1+ \left(1-\frac{m}{n}\right)T_{m+1}$$

$$E(T_m)-E(T_{m+1}) = \frac{n}{n-m}\tag{2}$$ As noted before, the L.H.S is simply $E(N_m)$ as defined in A1. In general we have, $$\sum\limits_{m=k}^{n-1}E(T_m)-\sum\limits_{m=k}^{n-1}E(T_{m+1}) = \sum\limits_{m=k}^{n-1}\frac{n}{n-m}$$

Noting that $T_n=0$ we have, $$E(T_k)=\sum\limits_{m=k}^{n-1}\frac{n}{n-m}$$ And letting $m=n-k$

$$E(T_{n-m}) = n\sum\limits_{k=1}^{m}\frac{1}{k}\tag{3}$$

We're interested in $T_0$, so let's substitute $m=n$ in equation (3).

$$E(T_0) = n \sum\limits_{k=1}^{n}\frac{1}{k}$$


Now, let's try and find the variance, $V(N)=V(T_0)$. Let's square both sides of equation (1). To make the algebra easier, let's re-arrange and note that $I(\frac{m}{n})(1-I(\frac{m}{n}))=I(\frac{m}{n})-I(\frac{m}{n})^2=0$.

$$=>(T_m-1)^2 = I\left(\frac{m}{n}\right)^2 T_m'^2+(1+I\left(\frac{m}{n}\right)^2-2I\left(\frac{m}{n}\right))T_{m+1}^2$$

Now, note the following property of Bernoulli random variables: $I(\frac{m}{n})^2=I(\frac{m}{n})$. This means:

$$T_m^2-2T_m+1 = I\left(\frac{m}{n}\right) T_m'^2+(1-I\left(\frac{m}{n}\right))T_{m+1}^2$$

We have to be careful here to note which random variables are i.i.d. and which are identical. See here: How to square equations involving random variables.

Taking expectation and doing some algebra gives us,

$$\left(1-\frac{m}{n}\right)E(T_m^2)=2E(T_m)+\left(1-\frac{m}{n}\right)E(T_{m+1}^2)-1$$

$$=>E(T_m^2)-E(T_{m+1}^2)=2E(T_m)\frac{n}{n-m}-\frac{n}{n-m}$$

$$=>\sum\limits_{m=0}^{n-1}E(T_m^2)-\sum\limits_{m=0}^{n-1}E(T_{m+1}^2)=\sum\limits_{m=0}^{n-1}2E(T_m)\frac{n}{n-m}-\sum\limits_{m=0}^{n-1}\frac{n}{n-m}$$

$$=> E(T_0^2)-E(T_n^2)=\sum\limits_{m=0}^{n-1}2E(T_m)\frac{n}{n-m}-\sum\limits_{m=0}^{n-1}\frac{n}{n-m}$$

But, $T_n=0$ and from equation (3), $E(T_m)=n \sum\limits_{k=1}^{n-m}\frac 1 k$. So we get:

$$E(T_0^2) = \sum\limits_{m=0}^{n-1}2E(T_m)\frac{n}{n-m}-\sum\limits_{m=0}^{n-1}\frac{n}{n-m}$$

$$=>E(T_0^2) = 2n^2 \sum\limits_{m=0}^{n-1}\frac{1}{n-m}\sum\limits_{k=1}^{n-m}\frac{1}{k} -n\sum\limits_{m=0}^{n-1}\frac{1}{n-m}$$ Now, change variables $j=n-m$

$$=>E(T_0^2) = 2n^2 \sum\limits_{j=n}^{1}\frac{1}{j}\sum\limits_{k=1}^{j}\frac{1}{k} -n\sum\limits_{j=n}^{1}\frac{1}{j}$$

$$=>E(T_0^2) = 2n^2\sum\limits_{1 \leq k \leq j \leq n} \frac{1}{jk}-E(T_0)\tag{4}$$

This can be used easily to get the variance.

$$V(T_0^2) = 2n^2\sum\limits_{1 \leq k \leq j \leq n} \frac{1}{jk}-E(T_0)-E(T_0)^2\tag{5}$$

Comparing equation (5) above with equation (2) of A1 we get the easily verifiable identity:

$$2\sum_{1\leq j\leq k \leq n} \frac{1}{jk}=\sum\limits_{i=1}^n\frac{1}{i^2}+\left(\sum\limits_{i=1}^n\frac{1}{i}\right)^2$$

Solution 4:

A1: Using a sum of geometric random variables


Consider the state where the collector has already collected $m$ coupons. How many coupons does he need to collect to get to $m+1$? Let this be represented by the random variable, $N_m$. Then, if the total coupons needed is $N$, we have:

$$N = \sum\limits_{m=1}^n N_m\tag{1}$$

Every coupon collected from here is like a coin toss where with probability $\frac m n$, the collector hits a coupon he already has and makes no progress. With probability $\frac{n-m}{n}$, he collects a new coupon. So, this becomes a geometric random variable with $p=\frac{n-m}{n}$. We know that a geometric random variable has a mean $\frac{1}{p}$ and variance $\frac{1-p}{p^2}$. Hence,

$$E(N_m)=\frac{n}{n-m}$$

Taking expectation of equation (1) and substituting we have:

$$E(N) = E(N_m) = \sum\limits_{m=1}^n \frac{n}{n-m}=n \sum\limits_{m=1}^n \frac{1}{n-m}$$

Substituting $m=n-m$ we get:

$$E(N) = n \sum\limits_{m=1}^n \frac{1}{m}$$

Similarly, the variance, $V(N)$ can be calculated.

$$V(N) = n^2\sum\limits_{i=1}^n \frac{1}{i^2}-n\sum\limits_{k=1}^n \frac{1}{k}\tag{2}$$