What is CDF - Cumulative distribution function?

If you have a quantity $X$ that takes some value at random, the cumulative density function $F(x)$ gives the probability that $X$ is less than or equal to $x$, that is:

$$F(x) = P(X\leq x)$$

So you know several things: that $F(x)$ is bounded below by 0, and bounded above by 1 (because it doesn't make sense to have a probability outside [0,1]) and that it has to be non-decreasing in $x$.

For example, if $X$ is the height of a person selected at random then $F(x)$ is the chance that the person will be shorter than $x$. If $$F(\textrm{180 cm}) = 0.8$$ then there is an 80% chance that a person selected at random will be shorter than 180 cm (equivalently, a 20% chance that they will be taller than 180cm).

A real-life example comes from finance. One way of measuring the risk of a portfolio (of stocks, for example) is to calculate the 5% daily value-at-risk, or VAR. To say that the 5% daily VAR is $x$ means you expect your loss to be worse than $x$ dollars on only 5% of days. For example, you might report that the 5% daily VAR is \$60,000, meaning that you expect to lose more than \$60,000 on 5% of days, and on the other 95% your loss will be less than \$60,000 (ideally, you will be in profit!)

To calculate the 5% VAR we need to know the cumulative distribution function of our losses. If the cumulative distribution function of daily losses is $F$, then the 5% daily VAR is the value of $y$ that solves the equation

$$F(y) = 0.05$$

The reporting of daily VAR is a requirement in financial institutions worldwide, so this certainly satisfies your requirements of a 'real-life' application!


It may be worthwhile to note why one would be interested in the notion of CDF is just one way to describe the distribution of a random variable and that one reason for preferring it is the fact that it works for both continuous and discrete variables.

Taking the example from another answer here about peoples' heights, say you model heights in terms of real numbers (I would say hardly a very controversial notion). While it seems reasonable to assume that you can ascribe a non-zero probability to a person being shorter than some specified height is may not be reasonable that you can assume a non-zero probability of the person being exactly of that height.

Unlike such a model of height (which you would call continuous) another (canonical) example would be a die roll. Here (most people would agree) you can ascribe a non-zero probability to the die landing on exactly some given value. You would here say the die roll has a discrete distribution.

The distribution of the latter example can be described by the probabilities of individual (atomic) events, the former case needs a notion of probability density function. The interesting fact, now, is that both discrete and continuous distributions can be described by their CDF.

I realise this is not completely formally correct or complete, but I hope it gives an idea of why the CDF might be an interesting way of describing distributions of random variables.


The CDF is a measure of how much a variable accumulates. It may help to look at this plot example. The CDF's are the black and blue lines, whereas the survival function (1-CDF) is the orange line. The likelihood of finding 200 mm of rainfall is related to a probability distribution. However, we can note that the amount of rainfall found increases alongside probability. If we move from a likelihood of 10% to 20%, the amount of rainfall does not reset to zero. The CDF is at 100% when the variable has been accumulated to the max, so there is nothing left to accumulate. When the CDF is at 100% (or 1), the survival function is at 0%. Note that the survival function and CDF intersect around 200 mm, effectively saying that nearly 50% of rainfall has accumulated at that point and 50% rainfall has yet to be accumulated.

EDIT: For my sample dataset of a normal distribution with an average of 6.019, this is what the CDF and survival functions look like. As the peak of a normal distribution represents the average, one expects the CDF to level off after the peak (ie, increase at a slower rate after the peak).