Last coupon collected in the coupon collectors problem

Consider the classical coupon collectors problem. Given a particular coupon $i$ we can ask for the probability that $i$ is the last coupon collected. We asked this question on cstheory and got a wonderful answer from James Martin introducing us to the idea of Poissonization in coupon collecting:

$Pr(L=i) = \int_{0}^{\infty} n_i e^{n_i t} \prod_{j\neq i} (1-e^{n_j t}) dt$

where $L$ is the last coupon collected and $n_i$ is the number of coupons of type $i$, with $\sum_i n_i = k$.

For example:

$\int_0^\infty a e^{-a t} \left(1-e^{-b t}\right) \left(1-e^{-c t}\right) \left(1-e^{-d t}\right) \left(1-e^{-e t}\right) \left(1-e^{-f t}\right) \, dt$,

gives the following in Mathematica:

$a \left[ \frac{1}{a}-\frac{1}{a+b}-\frac{1}{a+c}+\frac{1}{a+b+c}-\frac{1}{a+d}+\frac{1}{a+b+d}+\frac{1}{a+c+d}-\frac{1}{a+b+c+d}- \right.$ $\left. \frac{1}{a+e}+\frac{1}{a+b+e}+\frac{1}{a+c+e}-\frac{1}{a+b+c+e}+\frac{1}{a+d+e}-\frac{1}{a+b+d+e}-\frac{1}{a+c+d+e}+ \right.$ $\left. \frac{1}{a+b+c+d+e}-\frac{1}{a+f}+\frac{1}{a+b+f}+\frac{1}{a+c+f}-\frac{1}{a+b+c+f}+\frac{1}{a+d+f}-\frac{1}{a+b+d+f}- \right.$ $\left. \frac{1}{a+c+d+f}+\frac{1}{a+b+c+d+f}+\frac{1}{a+e+f}-\frac{1}{a+b+e+f}-\frac{1}{a+c+e+f}+\frac{1}{a+b+c+e+f}- \right.$ $\left. \frac{1}{a+d+e+f}+\frac{1}{a+b+d+e+f}+\frac{1}{a+c+d+e+f}-\frac{1}{a+b+c+d+e+f} \right]$.

The trouble is that this expression takes exponential time in the number of coupons to evaluate and has an exponential number of terms. We are looking for a way to bound this asymptotically in $k$, but any further references or information would be greatly appreciated.


You can view this as an inclusion-exclusion problem; perhaps the reformulation as such will be helpful. Also, there are at least two ways to set this up using inclusion-exclusion. I think the one that is equivalent to your example above results in the simpler expression, so I'll do that one. Then I'll talk a little about bounds at the end.

Reformulation using inclusion/exclusion:

Let $a, 1, 2, \ldots, m$ denote the different types of coupons, with $n_i$ coupons of type $i$. Suppose type $a$ is the type we want to collect last. Let $A_i$, $1 \leq i \leq m$, denote the event that coupon type $a$ is collected before type $i$. Then you're asking for

$$P(A'_1 \cap A'_2 \cap \cdots \cap A'_m) = 1 - P(A_1 \cup A_2 \cup \cdots \cup A_m).$$

By the inclusion-exclusion formula, this is

$$1 - \sum_{i=1}^m \mathbb{P}(A_i) + \sum_{i,j\,:\,i<j}\mathbb{P}(A_i\cap A_j) - \sum_{i,j,k\,:\,i<j<k}\mathbb{P}(A_i\cap A_j\cap A_k)+\ \cdots\ +(-1)^m\, \mathbb{P}\biggl(\bigcap_{i=1}^m A_i\biggr).$$

For any events $A_{i_1}, A_{i_2}, \ldots, A_{i_j}$, $P(A_{i_1} \cap A_{i_2} \cap \cdots \cap A_{i_j})$ is the probability that type $a$ is chosen before types $i_1, i_2, \ldots, i_j$. This is just $\frac{n_a}{n_a + n_{i_1} + n_{i_2} + \cdots n_{i_j}}$, as the other coupons are irrelevant for this calculation. So the probability you're after is

$ \begin{align} n_a \bigg(&\frac{1}{n_a} - \sum_{i=1}^m \frac{1}{n_a + n_i} + \sum_{i,j\,:\,i<j} \frac{1}{n_a + n_i + n_j} - \sum_{i,j,k\,:\,i<j<k} \frac{1}{n_a + n_i + n_j +n_k} + \cdots \\ & +(-1)^m\, \frac{1}{n_a + \sum_{i=1}^m n_i}\bigg). \end{align}$

(Note that this is a generalization of the expression in your example.)

Bounds:

(Added.) The term $\sum_{i=1}^m \frac{1}{n_a+n_i}$ is calculating $\frac{m}{H}$, where $H$ is the harmonic mean of $n_a+n_1, n_a+n_2, \ldots, n_a+n_m$. Since the arithmetic mean $A$ satisfies $A \geq H$ you could substitute for the entire first sum the simpler expression $$\frac{m}{A} = \frac{m^2}{n_am + \sum_{i=1}^m n_i}$$ and obtain an upper bound.

All the other terms are calculating harmonic means, too, so you could obtain an upper bound on your probability by using simple bounds on the harmonic mean. For the sums being subtracted, you could use an arithmetic mean, and for those being added, you could use the minimum. For more on bounds on the harmonic mean, see the power means inequality.

(Original comments on bounds.)

The difficulty here is that finding good bounds for general inclusion-exclusion problems is an ongoing research area.

The most basic bounds for inclusion-exclusion are the Bonferroni inequalities. Basically, they say that truncating the sum after any term gives either an upper bound or a lower bound, depending on whether the last term was added or subtracted. Thus you could get an upper bound by truncating after each odd term.

Otherwise, you can try a search on "bounding inclusion-exclusion"; perhaps that will turn up something that can be applied to your problem.


I have no idea if this bound will be of any use to you or not (or even how it compares to the inclusion-exclusion bound already posted), but you can get a quick(er) upper bound as follows. Without loss of generality let's assume the coupon we care about is coupon $k$ and that $n_1 \leq n_2 \leq \dots \leq n_{k-1}$. We can view the event that coupon $k$ is last as the intersection of $k-1$ different events.

$E_1$: The first coupon selected is not of type $k$.

$E_2$: The first coupon selected which does not match the first type of coupon drawn is not of type $k$.

...

$E_{n-1}$ the first coupon drawn does not match any of the first $n-2$ types of coupons drawn is not of type $k$.

The probability of $E_1$ is just $(1-\frac{n_k}{n})$. $P(E_2)$ is more complicated, but we know an expression for the probability of $E_2$ given that the first coupon drawn is of type $j$: $(1-\frac{n_k}{n-n_j})$, which is largest for $j=1$. It follows by averaging over all $j$ that $$P(E_2 | E_1) \leq 1-\frac{n_k}{n-n_1}$$ In general, we have $$P(E_j| E_1, E_2, \dots, E_{j-1}) \leq 1-\frac{n_k}{n-n_1-n_2-\dots-n_{j-1}}$$ And you can multiply this bound over all $1 \leq j \leq n-1$ to get a bound on the probability that $k$ is last: $$P(k \textrm{ last }) \leq \prod_{i=1}^{j-1} \left(1-\frac{n_k}{n-S_{k-1}}\right)$$ where $S_k=n_1+\dots+n_k$.

This bound is exact when all of the $n_j$ are equal, but in general it gets worse the further apart the $n_j$ are.