Guessing number of colors of beads in an urn
Motivation from cocktail bar
Every time when I order the cocktail “Latex and Prejudice” (“Латекс и предубеждение”) in the Tesla bar in Saint Petersburg (Russia) the barkeeper selects by random a small interesting photo$^1$ and attaches it with a clamp to the cocktail glass. At the beginning I got always different pictures and started to collect them. The more cocktails I ordered the more often the motifs repeated. Finally, I had drunken so much that almost every time I had to ask for a different photo. I was wondering how many different pictures there are but due to drunkenness couldn't solve the problem by myself.
Mathematical form using urn model
Given is an urn with an unknown number of balls that have an unknown number of colors. It is assumed that every color has equal probability. From this urn in total $n$ balls with $m$ different colors were drawn, $k_i$ balls for color $i$ were sampled, i.e. $n=\sum_{i=1}^m k_i$. How many different colors $M$ are in the urn? Sampling with and without replacement is of interest.
Questions
Some answers were already given. Now I am looking for either alternative answers or/and answers to the following more specific questions:
-
Let's only for the first question assume that we do not know if the balls were drawn with or without replacement. Is the maximum likelihood for drawing with replacement always higher than the maximum likelihood for drawing without replacement?
-
Is this answer helpful? If yes: Can we consider the calculated likelihoods as discrete probability distributions if they were be normalized (with support on integers in the range $(m,\infty)$)? What can we say about variance, standard error?
-
What can we say about variance, standard error of another answer?
Related problems
In this SE post the number of colors in the urn is also unknown but in the problem given here it is assumed that the colors have equal probability. Another SE post deals with lending books from a library that were already lent at an earlier time.
$\small{^1 \text{Because the site is accessible to minors,$\\$ the content of the photos is not discussed here.}}$
Drawing without replacement
Let $n$ be the number of balls that were sampled with $m$ distinct colors and color frequencies $k_i$ for $i=(1,\ldots, m)$. We call $M$ the number of colors in the urn and $K$ the number of balls per color. $M$ and $K$ are not known, except of the facts that $M\ge m$ and $K\ge \max(k_i)$. The likelihood to get the observed color counts $k_i$, if we assume values for $M$ and $K$, is given by the multivariate hypergeometric distribution $$\frac{\prod_{i=1}^{m}\binom{K}{k_i}}{\binom{K\cdot M}{n}}.\tag{1}$$
As the order of the color frequencies in $k_i$ doesn't matter for our problem we have to multiply eq.$(1)$ by the number of their multiset permutations $$\frac{M!}{(M-m)!\prod_{j=1}^lz_j!},\tag{2}$$ where $z_j$ are the multiplicities of the elements in $k$ and $l$ is the number of different elements in $k$. The factor $(M-m)!$ in eq.$(2)$ includes the multiplicity of $0$, i.e. all colors that are contained in the urn but that were not drawn. (Note that $k_i \ge 1$, because we count only the colors that were drawn but we do not count the colors that were not drawn.) Multiplication of eq.$(1)$ and eq.$(2)$ gives the likelihood for observing $k_i$ colors assuming values for $M$ and $K$ $$p_1(M,K)=\frac{\prod_{i=1}^{m}\binom{K}{k_i}}{\binom{K\cdot M}{n}}\frac{M!}{(M-m)!\prod_{j=1}^lz_j!}\tag{3}$$
Answer: The best guess for the number of colors in the urn is given by that value of $M$ that maximizes $p_1$ for an assumed $K$.
Drawing with replacement
This corresponds to the case $K\to \infty$ in eq.$(3)$ and the likelihood assuming $M$ converges to a multinominal distribution that as previously is multiplicated by eq.(2) $$p_2(M)=\frac{n!}{\prod_{i=1}^m k_i!}\left(\frac{1}{M}\right)^n \frac{M!}{(M-m)!\prod_{j=1}^lz_j!}\tag{4}.$$ Answer: The best guess for the number of colors in the urn is given by that value of $M$ that maximizes $p_2$.
Example
Let's assume the frequencies of the colors of the drawn balls to be $k_i=(1, 2, 1, 1, 1, 2, 1, 1, 3, 1, 1)$. It follows that $n=15, m=11, K\ge 3, M\ge 11$. The multiplicities of the frequencies in $k$ are $z_j=(1,2,8)$ for $l=3$ different elements in $k$. Plotting the likelihoods shows that there are most likely $21$ different colors in the urn for drawing with replacement ($K=\infty$, black curve). For drawing without replacement and assuming that there are not more balls per color than given by the example ($K=3$, red curve), we get $16$ different colors in the urn as most likely variant. Assuming that in the urn were $K=6$ balls per color we get $18$ colors (blue curve). For higher $K$ the black curve is approached.
Let $n$ be the number of balls sampled, $m \leq n$ the distinct colors seen, and $k_c$ the frequency of color $c$.
When answering questions of this form (I drew from an unknown distribution, after seeing the observations, what's my guess of the distribution?), you can start by asking yourself: If there were $M \geq m$ colors, what would the likelihood be that you saw the draw that you did?
You could find some answer to that. It would have the shape you expect: if the $k_c$ are on average high, then your guess would be that it was pretty unlikely; if the $k_c$ are low then the opposite. Call this quantity $f(M;n, m, \{k_c\})$.
But how do you put this together? Where does this bring you? Can you answer a question such as "what's the expected number of distinct colors"?
It turns out we can't just yet. Why not? We have infinitely many $M \geq m$, and it's not so simple to just take an "average" over infinitely many terms. In terms of bayesian probability, you first need a prior -- before seeing the results of any draws, what was your guess on the (distribution of) the number of distinct colors in the bin?
Why do we need this? You might complain that, if there were finitely many possibilities, we could just happily take an average and move on. But this is the same as what we just discussed -- you just chose the prior that all of the finitely many outcomes were equally likely. You'll have to do the same here.
So, what are your priors? Well, I've never been to this bar, but as a silly upper bound I'd be pretty certain they have less than 1,000,000 different photos. It's probably likely to be less than 100, too. I'd guess it's more likely to be 15 than it is to be just 3 (and after you see your fourth independent picture, you update your prior such that the probability that there are only three pictures is now zero!).
If your prior beforehand was some function $p(M)$, where $\sum_{M=1}^{\infty} p(M) = 1$, then we can finally come up with what your posterior -- your updated view of the world -- is given your observation $(n, m, \{k_c\})$.
So, what's your new likelihood that there are $M$ distinct colors? Well, naively it'd be $p(M) f(M; m, n, k_c)$ (multiplying your prior guess with the likelihood of the observation), but you also have to normalize and divide by $\sum_M p(M) f(M; m, n, k_c)$, i.e. this quantity summed over all $M$. Calling your new likelihood $p'(M)$, you'd find that your guess of the number of distinct colors is simply
$$\sum_{M=1}^{\infty} M p'(M).$$
Of course, the sum of $p'(M)$ over all $M$ is $1$ by construction, so the above summation is really just the expected value of a random variable.
Apologies for the somewhat nebulous answer here, but hopefully this gives you a place to start for how one might think about questions like these.