Urn with an infinite number of balls of finite, uniformly distributed colours [duplicate]

Without a proper prior on the number of colors in the urn, one has to rely on some ad-hoc recipe.

The maximum likelihood estimate is, I think, always the number of distinct colors actually seen.

I would look at the taper to 0 of the histogram of number of times each color is seen. If the rarest color was seen 100 times, I'd guess you've seen them all, but if the rarest colors occurred 3, 1, and 1 times apiece, I'd bet there were a few more colors to be seen, and would concoct a formula based on the shape of the taper.


Note: This is not a complete answer, but in depth explores some special cases and then puts up a conjecture for the general case.

First, an infinite number of balls in the urn only makes the question of what and what is not well-defined quite complicated. Since you are assuming equal probabilities for all colours, I assume the "infinite balls" is just to ensure that the probabilities don't change after drawing one. But that can be achieved by just putting the drawn ball back (and mixing the urn) before drawing again (usually called draw with replacement).

So I'll instead look at the following problem which should be equivalent to the one you meant (it is certainly equivalent to my interpretation of your description):

Given an Urn containing an unknown positive number of balls, all of which have a different colour, assume that on drawing $m$ balls we get $m_1$ balls of colour $c_1$, $m_2$ balls of colour $c_2$ and so on. What is the probability of the urn containing exactly $n$ balls?

The best way to approach this way is probably to first assume that there is a large, but finite upper bound $N$ for the possible number of colours (so we don't run into problems with infinities), and then consider the limit $N\to\infty$.

Now the problem as is is under-specified, as you don't say anything about the prior probability of the number of colours (that is, the probabilities to assign when $m=0$). It might be that it is rather unlikely that there are too many colours, because whoever made those balls would have had to pay more to make more balls. Or it might be that it is unlikely that there are very few colours because the maker wants to show off his paint-making abilities.

But given that we on't have any information about it, let's just take as prior probability that $n$ is equally distributed between $1$ and $N$ (that is, before any draws, each possible value of $n$ has the same probability $1/N$).

Thus before the first draw, the expected number of colours is $\langle n\rangle = N/2$ and thus diverges as $N\to\infty$. Note also that $P(n)\to 0$ for each individual $n$.

I'm also assuming the you know nothing about the order of colours (that is, if $c_1$ is green, and $c_2$ is orange, and an orange ball is drawn, you cannot conclude that there also is a green ball because when $c_2$ has been drawn, $c_1$ must exist, too.

Thus the first draw tells you absolutely nothing about the number of colours. All it tells you is that one of the colours is the colour you see. So you still have all possible numbers to be of equal probability, and $\langle n\rangle = N/2 \to \infty$ while $P(n)\to 0$ for all $n$.

Now the second draw is where it starts to get interesting, as here we can have two possible results: Either you draw the same colour again, or you draw another colour.

If there are $n$ colours in total, drawing the same colour again has probability $1/n$. Therefore we can use the law of Bayes to update our probablities to $$P(n|m_1=2) = \frac{1/n}{\sum_{j=1}^{N}{1/j}}$$ Now the sum in the denominator diverges for $N\to\infty$, and thus we still have $P(n)\to 0$ for all $n$. The expectation value then is easily calculated to be $$\langle n\rangle = \frac{N}{\sum_{j=1}^{N}{1/j}}$$ where the numerator grows linearly while the denominator only grows logarithmic in $N$, this for $N\to\infty$ we still get $\langle n\rangle \to \infty$.

On the other hand, if on the second draw you get another colour, all you've done is to exclude the possibility of there just being one colour, but all remaining options remain equally probable, so you'll have $$P(n|m_1=1,m_2=1) = \cases{ 0 & if $n=1$\\ \frac{1}{N-1} & otherwise }$$ It is quite obvious that also in this case, $P(n)\to 0$ and $\langle n\rangle\to\infty$ for $N\to\infty$.

Now let's look at the third draw. Let's first consider the case that on the first two draws we've got the same ball, and not the third draw again gives the same colour. In this case our prior probability for the third draw is $P(n|m_1=2)$ from above. Note that the denominator is independent of $n$ and thus cancels out in Bayes rule. So again taking the fact that the probability with $n$ balls to draw the same colour again is $1/n$, we get for the new probability: $$P(n|m_1=3) = \frac{1/n^2}{\sum_{j=1}^{N}{1/j^2}}$$ Now we get something new: The denominator converges for $N\to\infty$, as it is the series for $\zeta(2) = \pi/6$. Therefore the probabilities no longer vanish in the large $N$ limit, but rather you get $$\lim_{N\to\infty} P(n|m_1=3) = \frac{6}{\pi n^2}$$ For example, after drawing the same colour for three times, the probability of that being the only colour is already $\pi/6 \approx 0.52$. However the expectation value $\langle n\rangle$ still diverges.

We can see that this is easily extendible for an arbitrary number of draws: If you get always the same colour for $m_1=m$ draws, then the probability of having $n$ colours total is in the limit of large $N$ $$P(n|m_1=m) = \frac{1}{n^{m-1}\zeta(m-1)}$$ and the expectation value is $$\langle n\rangle = \frac{\zeta(m-2)}{\zeta(m-1)}$$ which for $m\ge 4$ also is finite. Indeed, already for $m=4$ we get $\langle n\rangle \approx 1.37$.

Going back to the case of three draws, the next case is where two of the draws were equal, but the third was different. As the order of draws should not matter, we can as well look at the case where the first two draws were different, but the third did give one of the two (it doesn't matter which one). This case is a bit easier to handle, as the prior probability for the third draw is still uniform on $\{2,\ldots,N\}$, with the case $n=1$ simply excluded. The probability to draw one of the two colours previously drawn is, of course, $2/n$ (there are $2$ colours out of $n$), so the probability distribution is $$P(n|m_1=2,m_2=1) = \cases{ 0 & for $n=1$\\ \frac{2/n}{\sum_{j=2}^N 2/j} & otherwise}$$ Note that you can actually cancel out the $2$ in the numerator.

For $N\to\infty$ again the denominator diverges; indeed, after cancelling out the $2$ it is just the zeta function with the first term missing.

From a conceptual point of view, it seems obvious that the only thing that should matter for the probability distribution is whether the newly drawn value is one that was already drawn previously, or a new one. Also, the order should obviously not matter.

Therefore generalizing the results above, I conjecture the following general formula:

Conjecture: If in the first $m$ draws $k$ different colours were drawn, the probability distribution is $$P(n|m,k) = \cases{ 0 & if $n<k$\\ \frac{1}{n^{k-1}\sum_{j=k}^N 1/j^{k-1}} & otherwise}$$ and the corresponding expectation value is $$\langle n\rangle = \frac{\sum_{j=k}^N 1/j^{k-2}}{\sum_{j=k}^N 1/j^{k-1}}$$ Introducing the "truncated zeta function" $$\zeta_k(m) = \sum_{j=k}^{\infty} \frac{1}{j^m} = \zeta(m) - \sum_{j=1}^{k-1} \frac{1}{j^m}$$ the limiting case can then be written as $$P(n|m,k) = \cases{ 0 & for $n<k$\\ \frac{1}{n^{k-1}\zeta_k(m-1)} & otherwise}$$ and $$\langle n\rangle = \frac{\zeta_k(m-2)}{\zeta_k(m-1)}$$