How many books are in a library?
My cousin is at elementary school and every week is given a book by his teacher. He then reads it and returns it in time to get another one the next week. After a while we started noticing that he was getting books he had read before and this became gradually more common over time. Naturally, I started to wonder how one could estimate the total number of books in their library.
Say the true number of books in the library is $N$ and the teacher picks one uniformly at random (with replacement) to give to you each week. If at week $t$ you have received a book you have read before $x$ times, is there an unbiased estimator for the total number of books in the library and what is the variance of this estimator? Is there another biased estimator with lower variance?
In my cousin's case, in the first $30$ weeks he received a book he had received before $3$ times.
Solution 1:
The Good-Turing estimate is given by $$\hat M ={N \over {1-{N_1 \over K}}}$$ where $N$ is the number of different names observed, $N_1$ is the number of names seen once, and $K$ is the total number of observations. For your data, assuming 24 observations were unique and the 3 duplicates were all seen twice, this yields $$\hat M = {27 \over {1-{24 \over 30}}}= 135.$$ I don't know about the standard error of this estimate; look up Good-Turing and see what you can find.
Solution 2:
Given there are $N$ books in total and $t$ weeks, the probability of having received $x$ duplicates (i.e. $t-x$ unique books) is $$S_2(t,t-x) \frac{ N!}{(N-(t-x))! N^t}$$ where $S_2$ denotes a Stirling number of the second kind. In your example $S_2(30,27)=10359090$.
The maximum likelihood estimate of $N$ given $t=30$ and $x=3$ is $135$. If there really were $N=135$ books then the expected number of duplicates would be about $3.01017$, while if there were $N=136$ books then the expected number of duplicates would be about $2.98951$. The value for $N$ which would give exactly $3$ is about $135.4907$ though of course this is not an integer.
So $135$ is a reasonable estimate. But there is still considerable uncertainty: any value for $N$ from $56$ to $601$ would have a likelihood for $3$ duplicates out of $30$ more than a tenth of the likelihood resulting from $N=135$. The variance for the number of duplicates if $N=135$ is about $2.26258$.
Solution 3:
The book picked on week $t$ was never picked before if, independently and $t-1$ times, another book was picked. This happens with probability $p^{t-1}$, where $p=1-N^{-1}$. Thus, the mean number of weeks up to week $t$ when the book picked was not a new one is $$ W_t=\sum_{s=1}^t\left(1-p^{s-1}\right)=t-\frac{1-p^t}{1-p}, $$ that is, $$ t-W_t=N\cdot(1-(1-N^{-1})^t).\tag{$\ast$} $$ The RHS increases from $1$ when $N=1$ to $t$ when $N=\infty$ and the book picked on week $1$ is always a new one, hence $0\leqslant W_t\leqslant t-1$. For every observed $W_t$ in $\{0,1,\ldots,0,t-1\}$, there exists a unique $\hat N_t$ such that $(\ast)$ holds.
The exact value of $\hat N_t$ has no algebraic expression when $t$ is large but, in the limit $t\to\infty$ and $W_t/t\to\mathtt w$, one knows that $\hat N_t/t$ converges to the unique solution $\nu$ of the equation $$ \nu\cdot(1-\mathrm e^{-1/\nu})=1-\mathtt w.\tag{$\dagger$} $$ Numerically, if $t=30$ and $W_t=3$, $(\ast)$ yields $\hat N_t=135.49$ hence one would conclude that $N=135$ or $N=136$, and the approximation $(\dagger)$ with $\mathtt w=1/10$ yields $\nu t=139.8$, which is not too far off the mark even though $t=30$ is relatively small (note that $\nu t$ always overestimates $\hat N_t$, due to convexity).