Probability - Expected number of draws to get all 52 cards at least once drawing in groups of size n [duplicate]

Imagine you have a deck of cards and want to be fairly sure that you draw each card once (with a perfectly fair, complete, and random deck on each draw, of course). You are drawing cards in groups of size n from the deck. What is the expected number of draws such that each card has been drawn at least once?

Similar to the coupon collector's problem, but not quite. How would one go about integrating the math for that algorithm with drawing multiple cards at the same time?

Edit: found some duplicate questions.

How to calculate the expected value of the coupon collector problem if we are collecting the coupons in groups of k?

Coupon Collector Problem with Batched Selections


We can actually solve some special cases. Suppose we have $n$ types of coupons which are drawn in packets of $q$ coupons, with no duplicates in the packets. We derive closed forms for any packet size and evaluate them for $q=2.$ Now with $T_{m,n,q}$ the number of ways of drawing $m$ packets of $q$-subsets of $[n]$ so that all possible values of $n$ are present we get for the probability of $m$ draws the closed form

$$P[T=m] = {n\choose q}^{-m} \sum_{k=1}^q {n\choose n-k} \times T_{m-1, n-k, q} \times {n-k \choose q-k} \\ = {n\choose q}^{-m} \sum_{k=1}^q {n\choose k} \times T_{m-1, n-k, q} \times {n-k \choose q-k}.$$

This is for $n\gt q$ since the process always halts at the first step when $n=q.$ The sum variable $k$ is the count of the values missing before the draw of the subset at position $m$ or alternatively of the values that appear in draw $m$ for the first time.

To compute the terms in $T$ we have a simple version of the computation at the following MSE link and introduce the generating function

$$[z^q] \prod_{l=1}^n (1+zA_l)$$

which generates the $q$-subsets so that

$$\left([z^q] \prod_{l=1}^n (1+zA_l)\right)^m$$

generates the $m$-sequences of $q$-subsets. We use inclusion-exclusion to remove those terms where some of the $n$ terms are missing. The nodes $P\subseteq A$ in the poset represent terms from the generating function where the elements of $P$ are missing plus possibly some more. This is evidently accomplished by setting the $A_l\in P$ to zero. We set the remaining $A_l$ to one to obtain a count. The contribution for a given $P$ is

$$[z^q] (1+z)^{n-|P|} = {n-|P|\choose q}.$$

Therefore inclusion-exclusion yields

$$\sum_{p=0}^n {n\choose p} (-1)^p {n-p\choose q}^m$$

which is

$$\bbox[5px,border:2px solid #00A000]{ T_{m,n,q} = \sum_{p=0}^n {n\choose p} (-1)^{n-p} {p\choose q}^m.}$$

This is zero when $m=0$ and $n\ge 1.$ Now as a sanity check we should have $T_{m,n,1} = {m\brace n} \times n!$ and indeed we obtain

$$\sum_{p=0}^n {n\choose p} (-1)^{n-p} {p\choose 1}^m = \sum_{p=0}^n {n\choose p} (-1)^{n-p} p^m = {m\brace n} \times n!$$

and the check goes through. The rest is as shown at the following MSE link.

Next let us try to verify that we have a probability distribution. We have

$$\sum_{m\ge 1} P[T=m] = \sum_{k=1}^q {n\choose k} {n-k\choose q-k} \sum_{p=0}^{n-k} {n-k\choose p} (-1)^{n-k-p} \sum_{m\ge 1} {n\choose q}^{-m} {p\choose q}^{m-1} \\ = \sum_{k=1}^q {n\choose k} {n-k\choose q-k} {n\choose q}^{-1} \sum_{p=0}^{n-k} {n-k\choose p} (-1)^{n-k-p} \frac{1}{1-{p\choose q}/{n\choose q}} \\ = \sum_{k=1}^q {n\choose k} {n-k\choose q-k} \sum_{p=0}^{n-k} {n-k\choose p} (-1)^{n-k-p} \left({n\choose q}-{p\choose q}\right)^{-1}.$$

Specializing to $q=2$ we get

$$2\sum_{k=1}^2 {n\choose k} {n-k\choose 2-k} \sum_{p=0}^{n-k} {n-k\choose p} (-1)^{n-k-p} \frac{1}{n(n-1)-p(p-1)} \\ = 2n(n-1) \sum_{p=0}^{n-1} {n-1\choose p} (-1)^{n-1-p} \frac{1}{n(n-1)-p(p-1)} \\ + n(n-1) \sum_{p=0}^{n-2} {n-2\choose p} (-1)^{n-2-p} \frac{1}{n(n-1)-p(p-1)}.$$

We evaluate the two sums by residues, using for the first sum

$$f(z) = \frac{(n-1)!}{n(n-1)-z(z-1)} \prod_{p=0}^{n-1} \frac{1}{z-p} = -\frac{(n-1)!}{(z-n)(z-(1-n))} \prod_{p=0}^{n-1} \frac{1}{z-p}.$$

The residue at infinity is zero and the residues at $n$ and $1-n$ are

$$-\frac{(n-1)!}{2n-1} \frac{1}{n!} - \frac{(n-1)!}{1-2n} \frac{(-1)^n (n-2)!}{(2n-2)!}.$$

We get for the second sum by the same technique

$$-\frac{(n-2)!}{2n-1} \frac{1}{n!} - \frac{(n-2)!}{1-2n} \frac{(-1)^{n-1} (n-2)!}{(2n-3)!}.$$

Negate and add to get

$$n(n-1) \times \\ \left(\frac{1}{2n-1}\left(\frac{2}{n}+\frac{1}{n(n-1)}\right) + \frac{(n-2)!}{1-2n} \frac{(-1)^{n-1} (n-2)!}{(2n-3)!} \left(1-2\frac{n-1}{2n-2}\right)\right) \\ = \frac{1}{2n-1} (2n-2+1) = 1.$$

This confirms it being a probability distribution.

We now give a closed form for the expectation. We find

$$\sum_{m\ge 1} m P[T=m] \\ = \sum_{k=1}^q {n\choose k} {n-k\choose q-k} \sum_{p=0}^{n-k} {n-k\choose p} (-1)^{n-k-p} \sum_{m\ge 1} m {n\choose q}^{-m} {p\choose q}^{m-1} \\ = \sum_{k=1}^q {n\choose k} {n-k\choose q-k} {n\choose q}^{-1} \sum_{p=0}^{n-k} {n-k\choose p} (-1)^{n-k-p} \frac{1}{\left(1-{p\choose q}/{n\choose q}\right)^2}.$$

This is

$$\bbox[5px,border:2px solid #00A000]{ {n\choose q} \sum_{k=1}^q {n\choose k} {n-k\choose q-k} \sum_{p=0}^{n-k} {n-k\choose p} (-1)^{n-k-p} \left({n\choose q}-{p\choose q}\right)^{-2}.}$$

We obtain $nH_n$ when we evaluate this for $q=1$ which is a good check but not exactly surprising since we have already seen this work at the other link (the reader is invited to attempt this computation using the above formula as a starting point, which is easier than what follows). We now try for a closed form for $q=2$ and get

$$2n^2(n-1)^2 \sum_{p=0}^{n-1} {n-1\choose p} (-1)^{n-1-p} \frac{1}{(n(n-1)-p(p-1))^2} \\ + n^2(n-1)^2 \sum_{p=0}^{n-2} {n-2\choose p} (-1)^{n-2-p} \frac{1}{(n(n-1)-p(p-1))^2}.$$

We use residues as before with the function

$$g(z) = \frac{(n-1)!}{(n(n-1)-z(z-1))^2} \prod_{p=0}^{n-1} \frac{1}{z-p} = \frac{(n-1)!}{(z-n)^2(z-(1-n))^2} \prod_{p=0}^{n-1} \frac{1}{z-p}.$$

Note that

$$\left(\prod_{p=0}^{n-1}\frac{1}{z-p}\right)' = -\prod_{p=0}^{n-1}\frac{1}{z-p} \sum_{p=0}^{n-1} \frac{1}{z-p}$$

We get for the residue at $n$

$$(n-1)! \left(-\frac{1}{(2n-1)^2} \frac{1}{n!} H_n -\frac{2}{(2n-1)^3} \frac{1}{n!}\right)$$

The residue at $1-n$ yields

$$(n-1)! \left(\frac{1}{(1-2n)^2} \frac{(-1)^n(n-2)!}{(2n-2)!} (H_{2n-2} - H_{n-2}) \\ - \frac{2}{(1-2n)^3} \frac{(-1)^n(n-2)!}{(2n-2)!} \right)$$

For the second sum we get for the residue at $n$

$$(n-2)! \left(-\frac{1}{(2n-1)^2} \frac{1}{n!} (H_n-1) -\frac{2}{(2n-1)^3} \frac{1}{n!}\right)$$

and for the one at $1-n$

$$(n-2)! \left(\frac{1}{(1-2n)^2} \frac{(-1)^{n-1}(n-2)!}{(2n-3)!} (H_{2n-3} - H_{n-2}) \\ - \frac{2}{(1-2n)^3} \frac{(-1)^{n-1}(n-2)!}{(2n-3)!} \right)$$

Collecting everything we find (observe that the terms on $H_{2n-2}$ and on $H_{n-2}$ and the third term from the residues at $1-n$ cancel the same way as in the computation of the probability that we saw earlier)

$$-\frac{1}{(2n-1)^2} H_n \left(\frac{2}{n} + \frac{1}{n(n-1)}\right) + \frac{1}{(2n-1)^2} \frac{1}{n(n-1)} \\- \frac{2}{(2n-1)^3} \left(\frac{2}{n}+\frac{1}{n(n-1)}\right) \\ + (n-1)! \frac{2}{(1-2n)^2} \frac{(-1)^n(n-2)!}{(2n-2)!} \frac{1}{2n-2}$$

This is

$$-\frac{1}{(2n-1)n(n-1)} H_n - \frac{1}{(2n-1)^2} \frac{1}{n(n-1)} \\ + (n-2)! \frac{1}{(1-2n)^2} \frac{(-1)^n(n-2)!}{(2n-2)!}$$

Flip the sign and multiply by $n^2(n-1)^2$ to obtain the formula

$$\frac{n(n-1)}{2n-1} H_n + \frac{n(n-1)}{(2n-1)^2} + (-1)^{n-1} \frac{n^2(n-1)^2}{(2n-1)^2} \frac{(n-2)!\times(n-2)!}{(2n-2)!}$$

We conclude with the closed form (an exact result) for the case of packets containing two coupons which is

$$\bbox[5px,border:2px solid #00A000]{ \frac{n(n-1)}{2n-1} H_n + \frac{n(n-1)}{(2n-1)^2} + (-1)^{n-1} \frac{n^2}{(2n-1)^2} {2n-2\choose n-1}^{-1}.}$$

This attractive formula obviously motivates further research, possibly into the case $q=3,$ which looks difficult, perhaps requiring a computer algebra system during the simplifications. Observe that the dominant asymptotic is $1/2\times n H_n$ which means $q=2$ is about twice as fast as a single coupon so the effect of there being no packets with duplicate coupons is negligible.

Here is some Maple code to help with further research into these statistics. Not optimized to the limit e.g. the sets could be replaced with hash tables, but sufficiently functional to verify all of the above.

with(combinat);

ENUM :=
proc(m, n, q)
    local ind, setct, d, data, seen, pos, res;

    setct := binomial(n, q);
    data := choose({seq(p, p=1..n)}, q);
    res := 0;

    for ind from setct^m to 2*setct^m-1 do
        d := convert(ind, base, setct);

        seen :=
        `union`(seq(data[d[pos]+1], pos=1..m));

        if nops(seen) = n then
            res := res + 1;
        fi;
    od;

    res;
end;

T := (m, n, q) ->
add(binomial(n,p)*(-1)^p*binomial(n-p,q)^m, p=0..n);


PM :=
proc(m, n, q)
    local ind, setct, d, data, seen, pos, res;

    setct := binomial(n, q);
    data := choose({seq(p, p=1..n)}, q);
    res := 0;

    for ind from setct^m to 2*setct^m-1 do
        d := convert(ind, base, setct);

        seen :=
        `union`(seq(data[d[pos]+1], pos=1..m-1));

        if nops(seen) < n and
        nops(seen union data[d[m]+1]) = n
        then
            res := res + 1;
        fi;
    od;

    res;
end;

PMX := (m, n, q) ->
add(binomial(n,k)*T(m-1,n-k,q)*binomial(n-k,q-k),
    k=1..q);

STATAPPROX :=
proc(n, q, f, mx)
    option remember;
    local res, m, setct;

    res := 0; setct := binomial(n, q);

    for m to mx do
        res := res +
        f(m)*PMX(m, n, q)/setct^m;
    od;

    res;
end;

EXPT := (n, q) ->
binomial(n, q)
*add(binomial(n,k)*binomial(n-k,q-k)
     *add(binomial(n-k,p)*(-1)^(n-k-p)
          *(binomial(n,q)-binomial(p,q))^(-2),
          p=0..n-k), k=1..q);

EXPT2 := n ->
n*(n-1)*harmonic(n)/(2*n-1) + n*(n-1)/(2*n-1)^2
+ (-1)^(n-1)*n^2/(2*n-1)^2*binomial(2*n-2,n-1)^(-1);

The following simple C program can be used to verify the closed form of the expectation by numerical simulation. These were in excellent agreement to several significant figures on all values that were examined. Compiled with the std=gnu99 option.

#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#include <time.h>
#include <string.h>

long choose(long n, long k)
{
  long num = 1, denom = 1;

  while(k > 0){
    num *= n;
    denom *= k;

    n--; k--;
  }

  return num/denom;
}

void recurse(int n, int q, int *cdata, int *cposrf,
             int nxt, int *sofar, int sf)
{
  if(sf == q){
    memcpy(cdata + q*(*cposrf), sofar, q*sizeof(int));
    (*cposrf)++;

    return;
  }

  if(nxt >= n) return;

  recurse(n, q, cdata, cposrf, nxt+1, sofar, sf);

  sofar[sf++] = nxt;
  recurse(n, q, cdata, cposrf, nxt+1, sofar, sf);
  sf--;
}


int main(int argc, char **argv)
{
  int n = 6 , q = 3, trials = 1000; 

  if(argc >= 2){
    n = atoi(argv[1]);
  }

  if(argc >= 3){
    q = atoi(argv[2]);
  }

  if(argc >= 4){
    trials = atoi(argv[3]);
  }

  assert(1 <= n);
  assert(1 <= q && q < n);
  assert(1 <= trials);

  int allpackets = (int)choose(n, q);
  int cdata[allpackets*q];
  int choice[q], cpos = 0;

  recurse(n, q, cdata, &cpos, 0, choice, 0);

  int pidx;
  for(pidx = 0; pidx < allpackets; pidx++){
    int offs = pidx*q;

    printf("%d", cdata[offs]);
    for(int elidx = 1; elidx < q; elidx++)
      printf(", %d", cdata[offs+elidx]);

    printf("\n");
  }

  srand48(time(NULL));
  long long data = 0;

  for(int tind = 0; tind < trials; tind++){
    int seen = 0; int steps = 0; 
    int dist[n];

    for(int cind = 0; cind < n; cind++){
      dist[cind] = 0;
    }

    while(seen < n){
      int packet = drand48() * (double)allpackets;
      int offs = packet * q;

      steps++;

      for(pidx = 0; pidx < q; pidx++){
        int coupon = cdata[offs + pidx];

        if(dist[coupon] == 0)
          seen++;
        dist[coupon]++;
      }
    }

    data += steps;
  }

  long double expt = (long double)data/(long double)trials;
  printf("[n = %d, q = %d, trials = %d]: %Le\n", 
         n, q, trials, expt);

  exit(0);
}

This is not a solution to the asked question, it is just an approximation

Let a deck with $M$ distinct cards, and each time you draw $n$ cards randomly, put inside again and draw again, etc... We will suppose that the probability to draw some card is the same for all cards, that is, $p=1/M$.

We will work this problem as a Markov chain: suppose you had drawn $k$ distinct cards (no matter in how many draws, ignore this) and you want to know the probability that drawing the next $n$ cards the state of distinct cards will change from $k$ to $k+j$, where $j\in\{0,\ldots,n\}$.

Then if we draw $n$ cards and all are repeated, we have that

$$\Pr[k\to k]=\frac{k}{M}\cdot\frac{k-1}{M-1}\cdots\frac{k-n+1}{M-n+1}=\frac{k^\underline n}{M^\underline{n}}$$

and in general

$$\Pr[k\to k+j]=\binom{n}{j}\frac{k^\underline{n-j}(M-k)^\underline{j}}{M^\underline n}$$

Then the expected change of $k$ from a draw is

$$\mathrm E[\text{change}]=\sum_{j=0}^n j\Pr[k\to k+j]=\frac1{M^\underline n}\sum_{j=0}^n j\binom{n}{j}k^\underline{n-j}(M-k)^\underline{j}\tag{1}$$

The last summation involves a well-known Chu-Vandermonde identity:

$$\sum_{k=0}^n \binom{n}{k}a^\underline k b^\underline{n-k}=(a+b)^\underline n$$

Then using some algebra in (1) we have that

$$\mathrm E[\text{change}]=\frac{n(M-k)}{M^\underline n}\sum_{j=0}^n \binom{n-1}{j-1}(k-1)^\underline{n-j}(M-k)^\underline{j-1}=\frac{n(M-k)}{M^\underline n}(M-1)^\underline{n-1}=n\frac{M-k}M=n\left(1-\frac{k}M\right)$$

The above means that from some draw the expected number of new cards is $n(M-k)/M$ (observe that this quantity is well-defined only when $0\le k\le M$), then (if Im not wrong, what is not sure) the expected number of different cards after $\ell$ draws is the recurrent sum

$$T_\ell:=\sum_{h=1}^\ell N_h\tag{2}$$

where $N_h:=n\left(1-\frac{\sum_{j=1}^{h-1}N_j}M\right) $ and $N_1=n$. I dont know if (2) have a closed form, anyway with different values of $\ell$ you can get an approximation for the minimal number of throws such that $T\ge M$.

EDIT:

It seems that (2) is closable, using some mathematical software I get the solution:

$$T_\ell=(6M\ell+\ell-\ell^3)\frac{n}{6M}$$

But this function for $M=52$ and $n=5$ never gets bigger than $\approx 34$ (this happen when $\ell=10$), so something is very wrong in my interpretation/calculation of the approximation. Probably the fastest way to approximate the expected number of draws is through some numerical modeling software as R.


In the wikipedia article about the Coupon collector's problem is stated that

Wolfgang Stadje has solved the case where the stickers are bought in packets, which contain no duplicates.[3] The results show that for practical applications, say packets of five stickers, the effect of packets is negligible.

Then this problem is practically the same than the original coupon problem.


From page 18 of this document there is an analysis for this case.