A fair die is rolled 10 times. Define N to be the number of distinct outcomes. Find the mean and standard deviation of N.

A fair die is rolled 10 times. Define N to be the number of distinct outcomes. For example, if we roll (3, 2, 2, 1, 4, 3, 1, 6, 2, 3) then N = 5 (the distinct outcomes being 1, 2, 3, 4 and 6). Find the mean and standard deviation of N. Hint: Define $I_j = \left\{ \begin{array}{ll} 1 & \text{if outcome j appears at least once} \\ 0 & \text{otherwise} \end{array} \right.$ and $j=1,...,6$. Then $N=\sum_{j=1}^{6} I_j$ where $I_1,...I_6$ are dependent.

Any help would be much appreciated!


Supposing that the fair die has $n$ sides and is rolled $m$ times we get for the probability of $k$ distinct outcomes by first principles the closed form

$$\frac{1}{n^m} \times m! [z^m] {n\choose k} (\exp(z)-1)^k.$$

Let us verify that this is a probability distribution. We obtain

$$\frac{1}{n^m} \times m! [z^m] \sum_{k=0}^n {n\choose k} (\exp(z)-1)^k.$$

Here we have included the value for $k=0$ because it is a constant with respect to $z$ and hence does not contribute to $[z^m]$ where $m\ge 1.$ Continuing we find

$$\frac{1}{n^m} \times m! [z^m] \exp(nz) = 1$$

and the sanity check goes through. Now for the expectation we get

$$\mathrm{E}[N] = \frac{1}{n^m} \times m! [z^m] \sum_{k=1}^n k {n\choose k} (\exp(z)-1)^k \\ = \frac{n}{n^m} \times m! [z^m] \sum_{k=1}^n {n-1\choose k-1} (\exp(z)-1)^k \\ = \frac{n}{n^m} \times m! [z^m] (\exp(z)-1) \sum_{k=0}^{n-1} {n-1\choose k} (\exp(z)-1)^k \\ = \frac{n}{n^m} \times m! [z^m] (\exp(z)-1) \exp((n-1)z).$$

This is

$$\frac{n}{n^m} \times (n^m - (n-1)^m) = n \left(1-\left(1-\frac{1}{n}\right)^m\right).$$

This goes to $n$ in $m$ as it ought to. For the next factorial moment we obtain

$$\mathrm{E}[N(N-1)] = \frac{1}{n^m} \times m! [z^m] \sum_{k=2}^n k (k-1) {n\choose k} (\exp(z)-1)^k \\ = \frac{n(n-1)}{n^m} \times m! [z^m] \sum_{k=2}^n {n-2\choose k-2} (\exp(z)-1)^k \\ = \frac{n(n-1)}{n^m} \times m! [z^m] (\exp(z)-1)^2 \sum_{k=0}^{n-2} {n-2\choose k} (\exp(z)-1)^k \\ = \frac{n(n-1)}{n^m} \times m! [z^m] (\exp(z)-1)^2 \exp((n-2)z).$$

This is

$$\frac{n(n-1)}{n^m} (n^m - 2(n-1)^m + (n-2)^m) \\ = n(n-1) \left(1-2\left(1-\frac{1}{n}\right)^m + \left(1-\frac{2}{n}\right)^m \right).$$

We get for the variance

$$\mathrm{Var}(N) = \mathrm{E}[N^2] - \mathrm{E}[N]^2 = \mathrm{E}[N(N-1)] + \mathrm{E}[N] - \mathrm{E}[N]^2$$

which yields

$$n^2 \left(1-2\left(1-\frac{1}{n}\right)^m + \left(1-\frac{2}{n}\right)^m \right) \\ - n \left(1-2\left(1-\frac{1}{n}\right)^m + \left(1-\frac{2}{n}\right)^m \right) + n \left(1-\left(1-\frac{1}{n}\right)^m\right) \\ - n^2 \left(1-2\left(1-\frac{1}{n}\right)^m + \left(1-\frac{1}{n}\right)^{2m} \right) \\ = n^2 \left(\left(1-\frac{2}{n}\right)^m - \left(1-\frac{1}{n}\right)^{2m} \right) + n \left(\left(1-\frac{1}{n}\right)^m - \left(1-\frac{2}{n}\right)^m \right).$$

The absolute value of the two differences of terms that are geometric in $m$ with positive common ratio less than one is bounded by the maximum of these two, which vanishes in $m$ and hence the variance goes to zero for $n$ fixed and $m$ going to infinity.

The queried standard deviation is then given by the root of the variance.

Here is some Maple code to explore these numbers.

ENUM :=
proc(n, m)
    option remember;
    local ind, data, gf;

    gf := 0;

    for ind from n^m to 2*n^m - 1 do
        data := convert(ind, base, n)[1..m];
        gf := gf + v^nops(convert(data, `multiset`));
    od;

    gf/n^m;
end;

EN_PROB := (n, m) -> subs(v=1, ENUM(n, m));

EN_FM1 := (n, m) -> subs(v=1, diff(ENUM(n, m), v));
EN_FM2 := (n, m) -> subs(v=1, diff(ENUM(n, m), v$2));

FM1 := (n, m) -> n*(1-(1-1/n)^m);
FM2 := (n, m) -> n*(n-1)*(1 - 2*(1-1/n)^m + (1-2/n)^m);

EN_VAR := (n, m) ->  - EN_FM1(n, m)^2
+ subs(v=1, diff(v*diff(ENUM(n, m), v), v));

VAR := (n, m) ->
n^2*((1-2/n)^m-(1-1/n)^(2*m)) + n*((1-1/n)^m-(1-2/n)^m);

Addendum. As pointed out by @JMoravitz we may need some clarification of the reasoning by which the probabilities are obtained. Note that $k$ distinct outcomes first of all require a choice of these $k$ values from the $n$ possibilities, for a factor of ${n\choose k}.$ Now we need to match up each of these $k$ values (think of them as listed sequentially from smallest to largest) with a subset of the $m$ rolls which may not be empty (this is what prevents double counting here). This yields the labeled combinatorial class (labeled means EGFs)

$$\def\textsc#1{\dosc#1\csod} \def\dosc#1#2\csod{{\rm #1{\small #2}}} \textsc{SEQ}_{=k}(\textsc{SET}_{\ge 1}(\mathcal{Z}))$$

or alternatively (compare to Stirling numbers of the second kind)

$$\textsc{SEQ}_{=k}(\textsc{SET}_{=1}(\mathcal{Z}) + \textsc{SET}_{=2}(\mathcal{Z}) + \textsc{SET}_{=3}(\mathcal{Z}) + \cdots).$$

We get the generating function

$$\left(z+\frac{z^2}{2!}+\frac{z^3}{3!}+\frac{z^4}{4!}+\cdots\right)^k = (\exp(z)-1)^k$$

and may then continue as before.

Addendum II. Another combinatorial class we can use here is

$$\textsc{SEQ}_{=n}(\mathcal{E} + \mathcal{U}\times\textsc{SET}_{\ge 1}(\mathcal{Z})).$$

This gives the EGF

$$G(z, u) = (1+u(\exp(z)-1))^n = \sum_{k=0}^n {n\choose k} u^k (\exp(z)-1)^k.$$

On evaluating

$$\left. \frac{\partial}{\partial u} G(z, u) \right|_{u=1}$$

using the second form we get the same sum as before. We can also give an alternate evaluation

$$\frac{1}{n^m} \times m! [z^m] \left. \frac{\partial}{\partial u} G(z, u) \right|_{u=1} = \frac{n}{n^m} \times m! [z^m] \exp((n-1)z) (\exp(z) - 1) \\ = \frac{n}{n^m} \times m! [z^m] (\exp(nz) - \exp((n-1)z)) = \frac{n}{n^m} (n^m - (n-1)^m) \\ = n \left(1-\left(1-\frac{1}{n}\right)^m\right).$$

For the next factorial moment we need another derivative which is

$$n(n-1)(1+u(\exp(z)-1))^{n-2} (\exp(z)-1)^2.$$

Set $u=1$ to get

$$n(n-1) \exp((n-2)z)(\exp(2z)-2\exp(z)+1).$$

Extracting coefficients now yields

$$\frac{n(n-1)}{n^m} \left(n^m - 2 (n-1)^m + (n-2)^m\right)$$

and the rest is as in the first version. Note that we get a faster enumeration routine if we admit the classification from the introduction. The result is midway between enumeration and the closed form and is shown below.

with(combinat);

ENUM2 :=
proc(n, m)
    option remember;
    local gf, part, psize, mset;

    gf := 0;

    part := firstpart(m);

    while type(part, `list`) do
        psize := nops(part);
        mset := convert(part, `multiset`);

        gf := gf + binomial(n, psize)
        *m!/mul(p!, p in part)
        *psize!/mul(p[2]!, p in mset)*v^psize;

        part := nextpart(part);
    od;

    gf/n^m;
end;

For the sake of brevity, I will assume that you already know the definitions of random variables, expected value, and standard deviation and will not formally define these terms. For a formal definition, seek out any textbook on the subject.

Our problem asks us to find the mean (expected value) and standard deviation (square root of variance) for the random variable $N$ which represents the total number of unique results from a throw of ten (presumably standard fair six-sided) dice. Clearly $N$ is an integral random variable which takes the values $0,1,2,\dots,6$ with corresponding probabilities $p_0,p_1,p_2,\dots,p_6$ respectively.

One (naive) approach to the problem would be to try to calculate each of $p_0,p_1,\dots$ and use the formula $E[N] = 0\cdot p_0+1\cdot p_1+2\cdot p_2+\dots+6\cdot p_6$, call this result $\mu$, and then continue to calculate $Var(N) = (0-\mu)^2\cdot p_0+(1-\mu)^2\cdot p_1+\dots+(6-\mu)^2\cdot p_6$. Although this would work, this is a horribly difficult and tedious approach and is not at all recommended. We don't actually care about finding the exact values of each of these probabilities if all we want to know about is the expected value and variance. Instead, we approach in a much smarter and more efficient way which is alluded to in the given hint.

By letting $I_j$ represent the Indicator Random Variable which takes the value of $1$ when result $j$ was included in the results of the ten thrown dice, and takes the value of $0$ otherwise. We notice then that $N$ could be represented by adding together $I_1,I_2,\dots,I_6$. We choose to do this because calculations using indicator random variables is much easier than otherwise since they only ever take the values of zero or one.

I am happy to see from the comments that you have realized the strength of this for finding the expected value of $N$. Indeed, by the linearity of expectation and the symmetry of the problem we have $E[N] = E[\sum\limits_{j=1}^6 I_j] = \sum\limits_{j=1}^6 E[I_j] = 6\cdot E[I_1] = 6\cdot Pr(I_1=1) = 6\cdot (1-(\frac{5}{6})^{10})$


Continuing the problem to find the Variance (and taking the square root of this result to find the standard deviation), we remember that from basic properties one finds that $Var(N)=E[N^2]-(E[N])^2$

We know how to deal with the second term from our earlier work in the previous part, but we now need to look at $N^2$ and see how it acts. Continuing to use the indicator random variables, we see that

$N^2 = (I_1+I_2+\dots+I_6)^2=(I_1+I_2+\dots+I_6)\cdot(I_1+I_2+\dots+I_6)$

$=I_1(I_1+I_2+\dots+I_6)+I_2(I_1+I_2+\dots+I_6)+\dots+I_6(I_1+I_2+\dots+I_6)$

As alluded to in my initial comments above, by expanding this completely and not adding any like-terms together, we are left with $36$ terms total in the above expression. By splitting these up into two categories, those which involve only one subscript and those which involve two subscripts, we have $6$ terms of the form $I_j^2$ and $30$ terms of the form $I_jI_k$ with $j\neq k$. (I specify without adding like-terms together to make calculations more transparent since you have $I_1I_2$ as well as $I_2I_1$ appearing once each)

We have as a result $E[N^2]=E[(\sum\limits_{j=1}^6 I_j)^2] = E[\sum\limits_{j=1}^6\sum\limits_{k=1}^6 I_jI_k] = E[\sum\limits_{j=1}^6 I_j^2 + \sum\limits_{(j,k)\in \{(x,y)\in [6]^2~:~x\neq y\}} I_jI_k]$

$=\sum\limits_{k=1}^6 E[I_1^2] + \sum\limits_{(j,k)\in \{(x,y)\in [6]^2~:~x\neq y\}} E[I_1I_2] = 6\cdot E[I_1^2]+30\cdot E[I_1I_2]$

Now, notice that $I_1^2 = I_1$. Why? Well, whenever $I_1$ takes the value of $1$, $I_1^2$ will also take the value of $1$ since $1^2=1$ and whenever $I_1$ takes the value of $0$, $I_1^2$ will also take the value of $0$ since $0^2=0$. Incredibly convenient! So, we have already done the necessary calculations to find this part.

Now... we look a bit more closely at $I_1\cdot I_2$. Since $I_1$ and $I_2$ are indicator random variables, you have $I_1\cdot I_2$ will take the value of $1$ if and only if both of $I_1$ and $I_2$ each took the value of $1$ as well and take the value of zero otherwise. That is to say $I_1\cdot I_2$ is also an indicator random variable! Specifically:

$I_1\cdot I_2 = \begin{cases} 1&\text{if}~1~\text{and}~2~\text{both appeared in the results of the ten thrown dice}\\0&\text{otherwise}\end{cases}$

So, $E[I_1\cdot I_2] = Pr(I_1\cdot I_2 = 1) = Pr(\text{both}~1~\text{and}~2~\text{appear})$

I will leave the rest to you to complete now, but all that remains should be a routine calculation perhaps involving the use of inclusion-exclusion and arithmetic putting all of the gathered information together into a final answer.