Geometric distribution expected value and variance

I am trying to prove that $E[X] = \frac 1 p $ and $Var[X] = \frac{1-p}{p^2}$ where $X$ follows a geometric distribution with probability $p$. I need to prove it recursively, using the fact that $X$ is $1$ with probability $p$ and $1 + Y$ with probability $(1-p)$ (for some $Y \geq 1$), where $Y$ has the same distribution as $X$.

For $E[X]$ I intuitively figured that $E[X] = 1 * p + (1 + E[Y]) * (1 - p)$ and since $E[X] = E[Y]$ the equation simplifies to the desired $E[X] = 1/p$.

Here is my first issue: I derived the above equation for $E[X]$ based on intuition alone. My thought process was basically: "using the definition of expectation, $E[X] = 1 * p + (1 + Y) * (1 - p)$, ... but wait, $Y$ is a random variable, so I think I need to take its expectation." Is there a way to justify why I was allowed to just take the expectation there?

Now, to compute the variance, I am also using a similar technique using a recurrence, but I run into some trouble:

$$Var[X] = E[(X - E[X])^2] = E[X^2] - \frac 2 p E[X] + \frac 1 {p^2} = E[X^2] - \frac 1 {p^2}$$

The issue is that I don't really know how to compute $E[X^2]$ using a similar recursive technique. I tried $$E[X^2] = 1^2 * p + (1 + E[X])^2 * (1 - p)$$ but that doesn't look right since $E[X]^2 \neq E[X^2]$ in general, and indeed the math doesn't work out when the equation is simplified (unless I made a mistake).

What is the intuition that I'm missing for this problem?


Here is my first issue: I derived the above equation for $E[X]$ based on intuition alone. My thought process was basically: "using the definition of expectation, $E[X]=1∗p+(1+Y)∗(1−p)$ , ... but wait, $Y$ is a random variable, so I think I need to take its expectation." Is there a way to justify why I was allowed to just take the expectation there?

It's called the Law of Iterated Expectation. (This is sometimes known as the Law of Total Expectation.)

If the probability space can be partitioned into a series of disjoint events, $\{A_k\}_n$, then:

$$\mathsf E(X) = \sum_{k=1}^n \mathsf E(X\mid A_k)\;\mathsf P(A_k)$$

In this case you are partitioning on the event of a success on the first attempt, and its complement.

$$\begin{align}\mathsf E(X) & = \mathsf E(X\mid X{=}1)\,\mathsf P(X{=}1) + \mathsf E(X\mid X{>}1)\,\mathsf P(X{>}1) \\ & = \mathsf E(X\mid X{=}1)\cdot p + \mathsf E(X\mid X{>}1)\cdot(1-p) \\ & = 1\cdot p + (1+\mathsf E(X))\cdot(1-p) \end{align}$$

Because $\mathsf E(X\mid X{>}1) = \mathsf E(X{+}1)$


For the variance, particularly the $\mathsf E(X^2)$ term, you can apply it again.

This time, though, $\mathsf E(X^2\mid X{>}1) = \mathsf E((X{+}1)^2)$.

$$\begin{align} \mathsf E(X^2) & = \mathsf E(X^2\mid X{=}1)\cdot p + \mathsf E(X^2\mid X{>}1)\cdot(1-p) \\ & = p + \mathsf E((X+1)^2)\cdot (1-p) \\ & = p + \mathsf E(X^2+2X+1)\cdot (1-p) \\ & = p + (\mathsf E(X^2) + 2\mathsf E(X) + 1)\cdot(1-p) \end{align}$$


For an infinite sequence of independent coin tosses with $P(H) = p$ in each toss, let $$\begin{align}X &= \text{Number of tosses until H occurs}\\ Y&=\text{Number of tosses after the 1st toss until H occurs}\end{align}$$

Then, using Iverson bracket notation,

$$ X= 1\cdot[\text{H on 1st toss}] + (1+Y)\cdot[\text{T on 1st toss}]$$ so (because $Y$ and $[\text{T on 1st toss}]$ are independent),

$$E(X) = 1\cdot p + (1+E(X))\cdot(1-p)$$ which implies $$E(X) = 1/p.$$

Furthermore, simply squaring the expression for $X$, we have $$\begin{align}E(X^2) &= E\big(\ [\text{H on 1st toss}]^2 + ((1+Y)\cdot[\text{T on 1st toss}])^2 + 2\cdot (1+Y)\cdot0\ \big)\\ &= E\big(\ [\text{H on 1st toss}] + ((1+Y)^2\cdot[\text{T on 1st toss}]\ \big)\\ &= p + E((1+2Y+Y^2)(1-p)\\ &= p + (1 + 2E(X) + E(X^2)) (1-p) \\ &= p + (1 + 2/p + E(X^2))(1-p)\end{align}$$

giving $$E(X^2) = \frac{2-p}{p^2} $$

and finally

$$\mathrm{var} (X) = E(X^2) - (E(X))^2 = \frac{1-p}{p^2}.$$

In the above, we have used the following facts:

  • $[\text{T on 1st toss}]\cdot[\text{H on 1st toss}]= 0$
  • squaring any Iverson bracket does not change its value
  • $Y$ and $[\text{T on 1st toss}]$ are independent

NB: The main idea of this approach is to express the random variable $X$ directly in terms of the random variable $Y$ (which has the same distribution as $X$) together with appropriate "conditional events" (as Iverson brackets). The resulting expression for $X$ is then easily manipulated to compute $E(X), E(X^2)$, etc. as ordinary expectations (without introducing conditonal expectations).