At time n, randomly choose a natural number ≤n. How long is it until a single number is chosen three times?

To clarify, the number ≤n is chosen uniformly at random at each step, and n chooses from the natural numbers beginning with 1.

I wish to determine the expected value of $n$ at which a natural number is chosen three times (for the first time). (I would also ideally like to know how to calculate E(a number being chosen $y$ times))

I calculated $\Pr(\text{hitting a number thrice at } n=x)$ for some low values, but it rapidly becomes a lot to do by hand.

\begin{align*} \Pr(n=1) &= \Pr(n=2) = 0 \\ \Pr(n=3) &= 1/6 \\ \Pr(n=4) &= 1/6 \\ \Pr(n=5) &= 19/120 \end{align*}

The inspiration for this question refers to the card game Hearthstone and a particular card interaction. See http://hearthstone.gamepedia.com/Grim_Patron http://hearthstone.gamepedia.com/Bouncing_Blade

The title of this question and the phrasing used throughout is not how I conceived the question, but a reformulation that mjqxxxx posted.


It turns out the numerical findings about $\mathbb{E}[N_3]$ by David E is not a coincidence, it is exact! $$\mathbb{E}[N_3] = \frac{1}{1-\sin(1)}$$

Let $X_1, X_2, \ldots $ be a sequence of random variables. For each $n$, we will assume $X_n$ take value from the set $\langle n \rangle \stackrel{def}{=}\{ 1, 2, \ldots, n \}$ with uniform probability.

After the $n^{th}$ iteration, consider following random variables:

  • $Y_{m} = \# \{ i \in \langle n \rangle : X_i = m \}$ be the number of times the number $m$ is chosen.
  • $Z_{m} = \# \{ i \in \langle n \rangle : Y_i = m \}$ be the number of numbers which has been chosen $m$ times.

For those configuration where none of the number has been chosen more than twice, $Z_{k}$ are not independent from each other. In fact, if we know $Z_0 = p$, we will have

$$Z_1 = n - 2p,\quad Z_2 = p\quad\text{ and }\quad Z_k = 0, \forall k > 2$$

For such a configuration, it is clear at the $(n+1)^{th}$ iteration, the probability that

  • $Z_0$ remains unchanged is $\frac{p+1}{n+1}$.
  • $Z_0$ increases by $1$ is $\frac{n-2p}{n+1}$.
  • some number get picked $3$ times is $\frac{p}{n+1}$.

If we define

  • $f_{n,p} = \mathbb{P}\left[ Z_0 = p, Z_1 = 2n-p, Z_2 = p, Z_k > 0 \land \forall k > 2 \right]$
  • $f_n(z) = \sum_{p=0}^n f_{n,p} z^p$,
  • $F(z,t) = 1 + \sum_{n=1}^\infty f_n(z) t^n$

We find $f_{n,p}$ satisfy following recurrence relation:

$$ f_{n+1,p} = \frac{p+1}{n+1}f_{n,p} + \frac{n-2(p-1)}{n+1}\begin{cases} f_{n,p-1}, & p > 0\\0, &p = 0\end{cases}\\ \implies (n+1) f_{n+1}(z) = \left\{ \left(1 + z\frac{\partial}{\partial z}\right) + z\left( n - 2z\frac{\partial}{\partial z}\right) \right\} f_n(z) $$ Multiply each term by $t^n$ and sum and notice $f_1(z) = 1$, we find:

$$\begin{align} & \frac{\partial}{\partial t} ( F(t,z) - 1 - t ) = \left\{ \left(1 + z\frac{\partial}{\partial z}\right) + z\left( t\frac{\partial}{\partial t} - 2z\frac{\partial}{\partial z}\right) \right\} ( F(z,t) - 1)\\ \iff & \left\{ (1 - zt)\frac{\partial}{\partial t} - (1 + z(1-2z)\frac{\partial}{\partial t}\right\} F(z,t) = 0\\ \iff & \left\{ (1 - zt)\frac{\partial}{\partial t} - z(1-2z)\frac{\partial}{\partial t}\right\} \left( \frac{z}{1-2z} F(z,t) \right) = 0 \end{align} $$ Using method of characteristics, one can show that the general solution of last PDE has the form

$$F(z,t) = \frac{1-2z}{z} \varphi\left(\frac{1-\sqrt{1-2z}}{1+\sqrt{1-2z}}e^{t\sqrt{1-2z}}\right)$$ where $\varphi(\cdots)$ is an arbitrary function. We can determine $\varphi(\cdots)$ by setting to $t = 0$, we have

$$\frac{1-2z}{z} \varphi\left(\frac{1-\sqrt{1-2z}}{1+\sqrt{1-2z}}\right) = F(z,0) = 1$$ After a little bit of algebra, we get $$F(z,t) = \frac{4u^2 e^{tu}}{((1+u) - (1-u) e^{tu})^2} \quad\text{ where }\quad u = \sqrt{1-2z} $$

Notice for each $n$, $f_n(1) = \sum_{p=0}^n f_{n,p}$ is the probability that none of the numbers has been chosen more than twice. This means $f_{n-1}(1) - f_n(1)$ is the probability that some "new" number get picked the third times. So the number we want is

$$\mathbb{E}[N_3] = (1-f_1(1)) + \sum_{n=2}^\infty n (f_{n-1}(1) - f_{n}(1)) = 1 + \sum_{n=1}^\infty f_n(1) = F(1,1)$$

Substitute this back to our expression of $F(z,t)$, we get

$$\begin{align} \mathbb{E}[N_3] & = \frac{4i^2 e^i}{((1+i) - (1-i) e^i)^2} = \frac{1}{1-\sin(1)}\\ & \approx 6.307993516443740027513521739824160128971342... \end{align} $$ There are other interesting statistics one can gather from $F(z,t)$. For example, the generating function for the "survival" probability $f_n(1)$ is

$$P_{survival}(t) \stackrel{def}{=} 1 + \sum_{n=0}f_n(1)t^n = F(1,t) = \frac{1}{1-\sin(t)}$$

and that for the probability for "death at $3^{rd}$ strike" at step $n$ is

$$\begin{align} P_{3^{rd} strike}(t) & \stackrel{def}{=} (1-f_1(1)) t + \sum_{n=2}^\infty (f_{n-1}(1) - f_{n}(1)) t^n\\ & = (t - 1) P_{survival}(t) + 1 = \frac{t-1}{1-\sin(t)} + 1 \end{align}$$

If one throw the last expression to a CAS and ask it to compute the Taylor expansion, one get

$$\begin{align} P_{3^{rd} strike}(t) = & \frac{1}{6}t^3 + \frac{1}{6}t^4 + \frac{19}{120}t^5 + \frac{47}{360}t^6 +\frac{173}{1680}t^7 + \frac{131}{1680} t^8 +\frac{20903}{362880}t^9\\ & +\frac{75709}{1814400}t^{10} + \frac{1188947}{39916800}t^{11} +\frac{2516231}{119750400}t^{12} + \cdots \end{align}$$ The coefficient for $t^k$ matches the numbers $Pr[N_3 = k]$ listed in another answer.

Update 1

Let's become Don Quixote and challenge the harder problem of computing $\mathbb{E}[N_y]$ for $y > 3$.

Similar to $y = 3$, one can setup a PDE for the generating functions. However, if one only want $\mathbb{E}[N_y]$, there is no need to solve the PDE completely. One can use the method of characteristics and express $\mathbb{E}[N_y]$ in terms of solutions of some ODE.

The derivation is a mess. I'll spare you all boring details and here is the recipe:

  1. The case for $y = 4$ is straight forward, one can show that $$\mathbb{E}[N_4] = \frac{\rho^3 + 3\rho + 2}{6} \;\;\text{ where }\;\; \rho \;\;\text{ satisfies }\;\; 1 = \int_1^\rho \frac{6ds}{s^3 + 3s + 2} $$

  2. For general $y$, setup a system of ODE in $y+1$ varibles $\psi, t, z_1, z_2, \ldots, z_{y-1}$: $$ \frac{d\psi}{d\tau} = z_1 \psi,\quad \frac{dt}{d\tau} = t z_1-1\quad \quad\text{ and }\quad \frac{dz_k}{d\tau} = \begin{cases} -z_k(z_k - z_{k+1}), & k < y-1\\ -z_k^2, & k = y-1 \end{cases} $$ If one start integrate this system of ODE using initial values $$( \psi, t, z_1, \ldots, z_{y-1} ) = (1, 1, \ldots, 1 )\quad\text{ at }\quad \tau = 0$$ to the point $\rho$ where $t(\rho) = 0$, then $\psi(\rho)$ will be equal to the number $\mathbb{E}[N_y]$ we seek.

Following are some numerical results. The label $\verb/R1/$ and $\verb/R2/$ indicate which recipe has been used to compute the result. The number behind the label $\verb/R2/$ is the maximum time-step used in numerically integration.

$$\begin{array}{c:l:l} y & \mathbb{E}[N_y] & \verb/methodology/\\ \hline 2 & 6.30799351644374 & \verb/exact/ = \frac{1}{1-\sin 1}\\ 2 & 6.3079930650 & \verb/R2 /(10^{-4})\\ 2 & 6.3079935164 & \verb/R2 /(10^{-6})\\ \hline 3 & 13.77250982352477 & \verb/R1/ \\ 3 & 13.7725078084 & \verb/R2 /(10^{-4})\\ 3 & 13.7725098234 & \verb/R2 /(10^{-6})\\ \hline 4 & 29.1475420469 & \verb/R2 /(10^{-4})\\ 4 & 29.1475467696 & \verb/R2 /(10^{-6})\\ \hline 5 & 60.5714357748 & \verb/R2 /(10^{-4})\\ 5 & 60.5714459868 & \verb/R2 /(10^{-6})\\ \hline 6 & 124.4243032167 & \verb/R2 /(10^{-4})\\ 6 & 124.4243208364 & \verb/R2 /(10^{-6}) \end{array} $$

As one can see, $\mathbb{E}[N_y]$ seems to approximately double for each increment of $y$.

Update 2

It turns out part of the complicated ODE in Update 1 can be solved explicitly.
For any given $y \ge 3$ and $k = 1,2,\ldots,y-1$, we have

$$z_{k}(s) = \frac{e_{y-k-1}(s)}{e_{y-k}(s)} \quad\text{ where }\quad e_m(x) = \sum_{k=0}^{m} \frac{x^k}{k!}$$ The new simplified recipe is

$$\mathbb{E}[N_y] = e_{y-1}(\rho) \quad\text{ where }\quad \rho \quad\text{ is root of the equation }\quad 1 = \int_0^\rho \frac{ds}{e_{y-1}(s)} $$ Following is some numerical results computed using this new recipe. All numbers has been truncated to 6-decimals places for easy comparison. As one can see from the $3^{rd}$ column, $2^y$ is a reasonable decent $1^{st}$ order approximation for $\mathbb{E}[N_y]$ as $y$ grows.

$$\begin{array}{r:r:l} y & \mathbb{E}[N_y] & \mathbb{E}[N_y]/2^y\\ \hline 3 & 6.307993 & 0.788499\\ 4 & 13.772509 & 0.860781\\ 5 & 29.147546 & 0.910860\\ 6 & 60.571445 & 0.946428\\ 7 & 124.424320 & 0.972065\\ 8 & 253.615739 & 0.990686\\ 9 & 514.170899 & 1.004240\\ 10& 1038.407593 & 1.014069\\ 11& 2091.272044 & 1.021128\\ 12& 4202.932580 & 1.026106\\ 13& 8433.748402 & 1.029510\\ 14& 16903.678242 & 1.031718\\ 15& 33849.909944 & 1.033017 \end{array} $$


Reprenting the health of each Patron by $y$, and the quantity of interest by $N_y$, I get: \begin{align*} \Pr[ N_3 = 1 ] &= 0 \\ \Pr[ N_3 = 2 ] &= 0 \\ \Pr[ N_3 = 3 ] &= 1/6 \\ \Pr[ N_3 = 4 ] &= 1/6 \\ \Pr[ N_3 = 5 ] &= 19/120 \\ \Pr[ N_3 = 6 ] &= 47/360 \\ \Pr[ N_3 = 7 ] &= 173/1680 \\ \Pr[ N_3 = 8 ] &= 131/1680 \\ \Pr[ N_3 = 9 ] &= 20903/362880 \\ \Pr[ N_3 = 10 ] &= 75709/1814400 \\ \Pr[ N_3 = 11 ] &= 1188947/39916800 \\ \Pr[ N_3 = 12 ] &= 2516231/119750400 \\ \Pr[ N_3 = 13 ] &= 3386161/230630400 \\ \Pr[ N_3 = 14 ] &= 147882737/14529715200 \\ \Pr[ N_3 = 15 ] &= 1832969507/261534873600 \\ \Pr[ N_3 = 16 ] &= 570448019/118879488000 \\ \Pr[ N_3 = 17 ] &= 1162831155151/355687428096000 \\ \Pr[ N_3 = 18 ] &= 1014210646079/457312407552000 \\ \Pr[ N_3 = 19 ] &= 4674810997597/3119105138688000 \\ \end{align*}

$\mathbb{E}[N_3] = 6.30799351644$ (calculated with $x < 100$)

This is calculated recursively by saving the number of Patrons with 2,1,0 hits respectively in a tuple (a,b,c), and calculating the possibilities recursively.

For reference, we have:

\begin{align*} \mathbb{E}[N_2] &= e \qquad \text{(easyish exercise)} \\ \mathbb{E}[N_3] &= 6.30799351644 \dots \\ \mathbb{E}[N_4] &= 13.7725 \dots \\ \mathbb{E}[N_5] &\approx 30 \end{align*}

Getting an exact answer for $y \geq 5$ is beyond my computational power, but I expect $\mathbb{E}[N_5] \approx 30$. Convergence really slows as $y$ grows. (As do the number of cases, and the size of numerators / denominators of fractions for exact calculations).

IMHO the chances of there existing a "nice" closed form for $y > 2$ are slim.

Though, interestingly, putting $\mathbb{E}[N_3]$ into the inverse symbolic calculator https://isc.carma.newcastle.edu.au/advancedCalc gives $\mathbb{E}[N_3] \approx 1/(1-\sin(1))$ to insanely high accuracy... but I imagine this is almost certainly a coincidence.

EDIT: Turns out I am completely wrong about this - this is exact - see achille hui's answer: https://math.stackexchange.com/a/1243379/200767

Rough bound on $\mathbb{E}[N_y]$

We can loosely bound $\mathbb{E}[N_y]$ though, as observe that the expected number of times the first patron is hit after $n$ total hits is

\begin{align*} \sum_{i=1}^{n} \frac{1}{i} \sim \ln n \to \infty \end{align*}

So, in particular, very loosely, if $\ln n \geq y$, we expect the first Patron will have been killed. Ie $\mathbb{E}[N_y] \leq O(e^y)$.

Code (Python 2.7):

from fractions import Fraction
N = 100;
y = 5;

probs = [{}] * (N+1);
probs[0] = {tuple([0] * y): Fraction(1, 1)};
death_prob = [Fraction(0,1)] * (N+1);

for n in range(1, N+1): # n-1 hits so far. Now calculating nth hit
    probs[n] = {};
    for prev_hit_counts, prob in probs[n-1].iteritems():
        prob_scaler = prob / n;

        for i in range(y): # A Patron who has already been hit i times gets hit
            if prev_hit_counts[i] == 0 and i != 0: # Check such a Patron exists
                continue

            new_hit_counts = list(prev_hit_counts);
            new_hit_counts[0] += 1; # Add a new Patron who has not been hit
            probability = new_hit_counts[i] * prob_scaler; # Probability of hitting such a Patron

            if i == y-1:
                # A Patron who has been hit y-1 times gets hit... we lose.
                death_prob[n] += probability;
            else:
                new_hit_counts[i] -= 1;
                new_hit_counts[i+1] += 1;
                if tuple(new_hit_counts) not in probs[n]:
                    probs[n][tuple(new_hit_counts)] = Fraction(0,1);
                probs[n][tuple(new_hit_counts)] += probability;


for n in range(1,N+1):
    if n <= 20:
        print '\\Pr[N_{0} = {1}] &= {2} = {3} \\\\'.format(y,n,death_prob[n],float(death_prob[n]));
    else:
        print '\\Pr[N_{0} = {1}] &= {2} \\\\'.format(y,n,float(death_prob[n]));

print '\\mathbb{{E}}[N_{0}] = {1}'.format(y,float(sum([n * i for n,i in enumerate(death_prob)])));
print 'Using 0 <= x <= {0}'.format(N);