Variance of time to find first duplicate

Solution 1:

An alternative approach is to use embedding procedures, as described for example in Chapter 15 of Problems and Snapshots from the World of Probability by Blom, Holst, and Sandell.

You can derive correct asymptotics from the equations:

\begin{eqnarray} E(X) &=& \int_0^\infty \left(1+{x\over n}\right)^n \,e^{-x}\,dx \\[5pt] E(X(X+1))&=& \int_0^\infty 2x\left(1+{x\over n}\right)^n \,e^{-x}\,dx. \end{eqnarray}

Your answer is correct, by the way.


Added: My cursory look at the literature suggests that the standard way to make these approximations rigorous is to write them in integral form, as I suggested, and then apply asymptotic techniques such as Laplace's method or the saddle point method.

For the leading term, though, it suffices to perform the change of variables $y=x/\sqrt{n}$. For $y\geq 0$, $(1+y/\sqrt{n})^n \exp(-y\sqrt{n})$ decreases to $\exp(-y^2/2)$ so by dominated convergence we get

\begin{eqnarray*}\int_0^\infty \left(1+{x\over n}\right)^n \exp(-x)\,dx &=& \sqrt{n} \int_0^\infty \left(1+{y\over \sqrt{n}}\right)^n \exp(-y\sqrt{n})\,dy\\[5pt] &\approx& \sqrt{n} \int_0^\infty \exp(-y^2/2)\,dy\\[5pt] &=&\sqrt{\pi n/2}, \end{eqnarray*}

and

\begin{eqnarray*}\int_0^\infty 2x \left(1+{x\over n}\right)^n \exp(-x)\,dx &=& 2n \int_0^\infty y \left(1+{y\over \sqrt{n}}\right)^n \exp(-y\sqrt{n})\,dy\\[5pt] &\approx& 2n \int_0^\infty y\exp(-y^2/2)\,dy\\[5pt] &=& 2n.\end{eqnarray*}


Another addition

Checking that it works:

Using the fact that $\int_0^\infty x^j e^{-x}\,dx=j!$ for all non-negative integers $j$, we get \begin{eqnarray*} \int_0^\infty\left(1+{x\over n}\right)^n e^{-x}\,dx &=& \int_0^\infty \sum_{\ell\geq 0} {n\choose \ell} \left({x\over n}\right)^j e^{-x}\,dx \\[5pt] &=& \sum_{\ell\geq 0} {n\choose \ell} {j!\over n^j} \\[5pt] &=& \sum_{\ell\geq 0} {n \over n} {n-1\over n} \cdots {n-\ell+1\over n} \\[5pt] &=& \sum_{\ell\geq 0} \mathbb{P}(X>\ell) \\[5pt] &=&\mathbb{E}(X). \end{eqnarray*}

You should also look at Example II.10 (page 114) in Analytic Combinatorics by Philippe Flajolet and Robert Sedgewick, freely available here: http://algo.inria.fr/flajolet/Publications/books.html

How they came up with it:

We embed our sampling problem into continuous time as follows. Let $(N_j(t))_{j=1}^n$ be $n$ independent Poisson processes with rate $1/n$, each one representing a particular birthday. The first time $T_j$ when the $j$th process has its second event, we have observed two people with birthday $j$. As the sum of two independent exponential random variables, $T_j$ has a gamma distribution with density $xe^{-x/n}/n^2$ so $$\mathbb{P}(T_j>x)=\left({1+{x\over n}}\right) e^{-x/n}.\tag1$$

We are interested in the expected value of $T:=\min_j T_j$, the time of the first repeated birthday. From (1) we obtain $$\mathbb{P}(T>x)=\left({1+{x\over n}}\right)^n e^{-x},\tag2$$ and hence $$\mathbb{E}(T)=\int_0^\infty \left({1+{x\over n}}\right)^n e^{-x}\,dx.$$

Now how does this compare with the discrete time version? Well, we can superpose all these Poisson processes $(N_j(t))_{j=1}^n$ into a single Poisson process with rate $\sum_{j=1}^n 1/n=1$. Looking only at the types of events, and ignoring the time between them we recover our discrete time sampling process. In particular, we get $T=\sum_{i=1}^X Y_i$ where $Y_i$ are i.i.d. exponential(1) random variables and $X$ is the first time to get a repeated birthday under discrete time sampling. The $Y_i$'s and $X$ are independent, so by Wald's identity we have $\mathbb{E}(T)=\mathbb{E}(X)\,\mathbb{E}(Y)= \mathbb{E}(X).$

An analysis of the higher moments gives $\mathbb{E}(T^j)= \mathbb{E}(X(X+1)\cdots (X+j-1))$ for all integers $j\geq 0$.