Intuition behind a hypergeometric distribution with standard deviation greater than the mean

Say I've bought 100 tickets in a raffle of 10,000 tickets total.

Assuming three tickets are randomly drawn without replacement from this raffle, we can represent the probability of different outcomes as a hypergeometric distribution with parameters $n = 3$, $s = 100$ and $N = 10,000$.

In this case, the number of my tickets that I can expect to be drawn are:

$$E(X) = n\left(\frac{s}{N}\right) = 0.03$$

With standard deviation of:

$$\sigma(X) = \sqrt{n \left(\frac{s}{N}\right) \left(1 - \frac{s}{N}\right) \left(\frac{N - n}{N - 1}\right)} = 0.17$$

I'm trying to intuitively understand how this makes sense - our expected number of successes in this instance is $0.03 \pm 0.17$. This means that negative successes lie in the range of probable outcomes, but by definition we can't have negative successes.

Hence, are the expected outcomes positively skewed in some way? E.g. say each prize was worth €20, we'd have an expected return of $ € 0.60 \pm €3.40$. But, since we can't have negative prizes, this actually lies in the range €0 - €4.00.

This doesn't seem right but I'm trying to figure out where I'm going wrong with my assumptions and interpretations above.

Many thanks.


Solution 1:

This hypergeometric distribution takes values $k = 0,1,2,3.$ Almost all of the probability is at $0.$ [Computations in R.]

k = 0:3
pdf = dhyper(k, 100, 10000-100, 3)
cbind(k, pdf)
     k          pdf
[1,] 0 9.702961e-01
[2,] 1 2.940885e-02
[3,] 2 2.941182e-04
[4,] 3 9.704911e-07

Your computations for the mean and SD are correct:

sum(pdf)
[1] 1
mu = sum(k*pdf); mu
[1] 0.03
sqrt(sum((k-mu)^2 * pdf))
[1] 0.1723196

If we simulate a million such lotteries, outcomes are as follows:

set.seed(2022)
x = rhyper(10^6, 100, 10000-100, 3)
mean(x)
[1] 0.029836
sd(x)
[1] 0.1718134
table(x)/10^6
x
        0        1        2        3 
 0.970450 0.029265 0.000284 0.000001 

The simulated values of the mean and SD are consistent with your computations for the population (within a couple of decimal places of accuracy) and are about right for a million iterations.

hist(x, prob=T, br=seq(-.5,3.5), col="skyblue2")
 points(k, pdf, pch=19, col="red")

enter image description here

The vertical resolution of this graph is about $0.02,$ so the bars for $P(X = 2), P(X = 3)$ are too short to show.

An unusual feature of the hypergeometric may be the very low number of prizes, and so the very low probability of winning. As seems logical, if you buy 100 tickets, you have a very low probability of winning one prize and almost no chance at all of winning more than one.

The interval $0.03±0.17$ for 'possible' values may be appropriate for a normal random variable, but not for this particular hypergeometric random variable. What you seem to consider a surprisingly large standard deviation is due to the skewness of the distribution and its right tail.