Guessing the length of a playlist on "shuffle random?"

The other night I was hanging out with some friends and someone put on a playlist on shuffle random, where the songs are drawn uniformly at random from a fixed playlist. The person who put the playlist together forgot how many songs were in it, so the topic came up of how to estimate the size of the playlist purely based on what we hearing.

We came up with a few high-level ideas for how to do this. For example, using ideas from the Birthday Paradox, we thought we could listen until we heard a song repeated for the first time, then use that to make an educated guess about how many songs were on the playlist in total. We also thought that we could listen for a long time and build up a frequency histogram of the number of times each song was played, then use the fact that it should look somewhat normally distributed to get the mean and variance and from there estimate the total number of songs on the list.

None of us are statisticians or have a lot of training in machine learning, but I suspect that this is probably a well-studied problem and that there are some really nice techniques we can use to estimate the playlist size.

Are there a good family of techniques for estimating the playlist size? From a practical perspective, would any of these techniques be something that would be relatively easy to work out without a computer or calculator?

Thanks!


From the Langford&Langford paper, suppose you have heard $m$ songs, of which $i$ were different, and $i<m$. You wish to find the size of the playlist, whose most likely value (in the maximum likelihood sense) is $\hat{N}$, which is given by $$\hat{N}=\left\lfloor \frac{1}{1-y^\star} \right\rfloor$$ where $\lfloor \cdot \rfloor$ denotes the floor function, and $y^\star$ denotes the smaller positive root of the polynomial $$y^m-iy+(i-1)=0$$

Note: if $i=m$, then all the songs you've heard so far have been distinct. Then there is no good way to estimate the size of the playlist; it might be infinite for all we know.

For convenience, here is a table of values for $\hat{N}$, all for $m=10$. For example, if you listen to $m=10$ songs and hear $i=7$ different ones among them, then the most likely size of the playlist is $\hat{N}=12$.

\begin{array} {|r|r|} \hline i & \hat{N}\\ \hline 1 &1 \\ 2&2\\ 3&3\\ 4&4\\ 5&5\\ 6&8\\ 7&12\\ 8&19\\ 9&42\\ \hline \end{array}


As @dsaxton suggests, one method is 'capture-recapture' (also called 'mark-recapture'). See the Wikipedia article for some technical details (if you can deal with the needlessly confusing notation). Here is the basic method with an intuitive argument for it.

Method. Listen to $c = 20$ distinct songs, noting their titles. Then listen to another $r = 20$ songs (also distinct among themselves), and count the $x$ repeats between the two groups of songs.

Estimation. Let $T$ be the total number of songs on the playlist. The fraction $x/r$ is the proportion of songs from the first group that are also in the second. The fraction $c/N$ is the proportion of songs in the whole playlist that are in the first group. Intuitively, the two proportions ought to be about the same: $x/r \approx c/N$, which leads to $N \approx cr/x.$ So if you had $x = 5$ repeats, then you estimate that $N = 400/5 = 80$ songs on the playlist.

Obviously, this method does not work if you have $x = 0$ repeats, Because of the possible difficulty with $x = 0$, the estimate does not have a mean or variance. This difficulty is (technically) circumvented by adding $1$ to each of the quantities in the estimate: $N = (c+1)(r+1)/(x+1).$ Even so, the method works better for larger $x$ (and also for larger $c$ and $r$, if you have the patience).

Undoubtedly, the method could be improved by continuously monitoring of repeats, but these strategies might be too messy to implement in the recreational setting of your question ("without a computer or a calculator").


Comment on simulation results for method based on time to first repeat.

Method: Let $Y$ be the wait (number of songs) until the first repeat. Then estimate the size $N$ of the playlist as $\hat N = Y(Y-1)/2.$ This certainly satisfies OP's request for computational simplicity.

Some simulation results: Based on 100,000 iterations for each of ten values of $N$ from 10 through 200, it seems that this estimator is unbiased, $E(\hat N) = N$, but has a relatively large SD (almost as large as $N$). The distribution of $\hat N$ is strongly right-skewed.

I believe I have seen this method before, but cannot immediately recall a rigorous rationale. Perhaps it can be extended to waiting for some small number $k > 1$ of repeats to give an unbiased estimator with a smaller SD.

Comment on limited simulation results for capture-recapture and Langford (inverse coupon collecting) methods.

It is not difficult to simulate the capture-recapture method to see how it performs in a particular instance. Also, the table provided by @vadim123 makes it easy to simulate the Langford method for that one case.

To make the two methods as comparable as possible, suppose the actual length of the playlist is 50.

In the capture-recapture method, suppose we look for matches in two samples of five songs, (that is, $c = r = 5$). In 100,000 iterations we got an average estimate of playlist size of 28 with a SD of 9.5. The mean plus two standard deviations does not reach the actual value 50.

In the Langford method we count the unique outcomes among ten songs. In 100,000 iterations we got ten unique songs (and hence no estimate) in about 38,000 of the 100,000 iterations. Among the iterations that produced estimates, the average estimate playlist size was about 34 with SD about 11.5. Again seriously underestimating of the length of the playlist when we have estimates, and getting no useful estimate over a third of the time.

It is fair to say that listening to only 10 songs does not give brilliant results with either method. My guess is that the Langford method works best when the number of songs observed is larger.

Addendum after some comments: A simulation with $N = 35$ (and the same sample sizes) gives a mean estimate about 25 with SD 10 for capture recapture; mean about 31 with SD 13 and no estimate about a quarter of the time with Langford method. The point remains that neither method seems to work very well for small numbers of observed songs. As far as I am concerned, the Question still awaits a simple and more useful answer, if such exists.