Coupon Problem generalized, or Birthday problem backward.

I want to solve a variation on the Coupon Collector Problem, or (alternately) a slight variant on the standard Birthday Problem. I have a slight variant on the standard birthday problem.

In the standard Coupon Collector problem, someone is choosing coupons at random (with replacement) from n different possible coupons. Duplicate coupons do us no good; we need a complete set. The standard question is "What is the expected number of coupons (or probability distribution in number of coupons) to collect them all?

In the standard birthday problem, we choose k items from n options with replacement (such as k people in a room, each with one of 365 possible birthdays) and try to determine the probability distribution for how many unique values there will be (will they have the same birthday?).

In my problem, someone has chosen k items from n options and I know that there were p distinct values, but I don't know what k was. If p=n this is the coupon problem, but I want to allow for values of p that are less than n. I want to determine the probability distribution for k (actually, all I need is the expected value of k, but the distribution would be interesting as well) as a function of p and n.


To do this properly, you need some sort of underlying distribution for $k$. For example you could use Bayesian methods: start with a prior distribution for $k$, multiply it by the likelihood of seeing $p$ for each $k$ given $n$, and divide by some constant to get a posterior probability distribution for $k$. You might then take the mean, median or mode of that posterior distribution as your central estimate; with an improper uniform prior, taking the mode would give the maximum likelihood estimate.

Another approach based on the coupon collector's calculation might be to assume that the person was aiming to get $p$ distinct items and stopped when this was achieved. In that case the expected value of $k$ would be $$\sum_{j=n-p+1}^n \frac{n}{j} = n(H_n - H_{n-p}) \approx n \log_e \left(\frac{n}{n-p}\right) $$ though the dispersion would be relatively wide. $H_n$ is the $n$th harmonic number.


This is a statistical problem, not a probabilistic problem: you have observed data (the value $p$) and seek to infer the underlying probabilistic process (the parameter $k$). The process going from $k$ to $p$ is understood, but the reverse is much more difficult. You cannot "solve" this problem of parameter estimation.

The Maximum Likelihood Estimator of $k$ would be $\hat k = p$. Other statistical criteria would lead to different estimates.