What exactly is a probability measure in simple words?

Can someone explain probability measure in simple words? This term has been hunting me for my life.

Today I came across Kullback-Leibler divergence. The KL divergence between probability measure P and Q is defined by,

$$KL(P,Q)= \begin{cases} \int \log\left(\frac{dP} {dQ}\right)dP & \text{if}\ P\ll Q, \\ \infty & \text{otherwise}. \end{cases}$$

I have no idea what I just read. I looked up probability measure, it refers to probability space. I looked that up, it refers to $\sigma$-algebra. I told myself I have to stop.

So, is a probability measure just a probability density but a broader and fancier saying? Am I overlooking a simple concept, or is this topic just that hard?


Solution 1:

A probability space consists of:

  1. A sample space $X$, which is the set of all possible outcomes of an experiment
  2. A collection of events $\Sigma$, which are subsets of $X$
  3. A function $\mu$, called a probability measure, that assigns to each event in $\Sigma$ a nonnegative real number

Let's consider the simple example of flipping a coin. In that case, we have $X=\{H,T\}$ for heads and tails respectively, $\Sigma=\{\varnothing,\{H\},\{T\},X\}$, and $\mu(\varnothing)=0$, $\mu(\{H\})=\mu(\{T\})=\frac{1}{2},$ and $\mu(X)=1$. All of this is a fancy way of saying that when I flip a coin, I have a $0$ percent chance of flipping nothing, a $50$ percent chance of flipping heads, a $50$ percent chance of flipping tails, and a $100$ percent chance of flipping something, heads or tails. This is all very intuitive.

Now, getting back to the abstract definition, there are certain natural requirements that $\Sigma$ and $\mu$ must satisfy. For example, it is natural to require that $\varnothing$ and $X$ are elements of $\Sigma$, and that $\mu(\varnothing)=0$ and $\mu(X)=1$. This is just saying that when performing an experiment, the probability that no outcome occurs is $0$, while the probability that some outcome occurs is $1$.

Similarly, it is natural to require that $\Sigma$ is closed under complements, and if $E\in\Sigma$ is an event, then $\mu(E^c)+\mu(E)=1$. This is just saying that when performing an experiment, the probability that event $E$ occurs or doesn't occur must be $1$.

There are other requirements of $\Sigma$ which make it a $\sigma$-algebra, and other requirements of $\mu$ which make it a (finite) measure, and to rigorously study probability, one must eventually become familiar with these notions.

Solution 2:

To describe a random variable $X$, we specify what the probability is that the outcome of $X$ is some value $x$. For example, with a fair die and $X$ standing for "the score of one roll of the die", we'd say $$P(X=1)=P(X=2)=P(X=3)=P(X=4)=P(X=5)=P(X=6)=\frac16$$ and that's it. Our $X$ takes values only from the finite set $\Omega=\{1,2,3,4,5,6\}$.

There are also random variables with (countably) infinitely many possible outcomes. For example, if $Y$ stands for "the number of throws of a fair coin until head appears the first time, then $$P(Y=1)=\frac12, P(Y=2)=\frac14, P(Y=3)=\frac18,\ldots $$ The set $\Omega$ of possible outcomes is now $\Omega=\mathbb N$.

And finally there are random variables with uncountably many possible outcomes (e.g. let $Z$ stand for "select a random point uniformly on the unit interval $\Omega:=[0,1]$"). In these cases usually for any individual value $x\in\Omega$, the probability $P(Z=x)$ is simply zero. Instead, we have positive probability only if we ask for certain infinite subsets of the space $\Omega$ of possible outcomes. For example, we can righteously say $P(\frac12< X<\frac23)=\frac16$. It would be nice if one could assign a probability value to any subset $S\subseteq \Omega$. However, it usually turns out that this is not possible in a consistent or well-defined manner. One will still strive to make the collection of sets $S$ for which $P(X\in S)$ is defined/definable as large as possible. For our example $Z$, we can certainly say $P(X\in S)=b-a$ if $S$ is an interval $[a,b]$ or $]a,b[$ or $]a,b]$ or $[a,b[$ with $0\le a\le b\le 1$. Especially, $P(X\in\emptyset)=0$ and $P(X\in\Omega)=1$. Also, if $A,B$ are disjoint and $P(X\in A)$ and $P(X\in B)$ make sense, then so does $P(X\in A\cup B)$, namely with the value $P(X\in A\cup B)=P(X\in A)+P(X\in B)$. In fact, if we have sets $A_1,A_2,\ldots$ and know $P(X\in A_n)$ for each $n$, then it turns out to be advisable to have $$P(\bigcup_{n=1}^\infty A_n)=\sum_{n=1}^\infty P(X\in A_n).$$ This is almost the concept of a $\sigma$-algebra: It is a collection of subsets of a given set $\Omega$. If we are lucky, such as in the finite case or the countable case (at least as it occured with the random variable $Y$ we defined) this collection is the full powerset of $\Omega$, but it may be smaller. At any rate, it is large enough to be closed under certain operations, among which is the countable union of sets. And this property is precisely what allows us to formulate the essential properties we want to have for probabilities of a random variable being in a subset of $\Omega$. Any function that assigns to each element of a given $\sigma$-algebra (i.e. to each sufficiently nice subset of $\Omega$) a value between $0$ and $1$ inclusive, such that the basic rules as spelled out above hold for countable unions, complements, the whole space, is then called a probability measure.

One important measure is the Lebesgue measure $\lambda$ on $[0,1]$ (which describes the random variable $Z$ above). You may know it from integration theory, where it allows us to generalize (extend) the Riemann integration. You may know for example, that the expected value of a finite random variable is simply given by $$\tag1E(X) = \sum_{x\in\Omega}x\cdot P(X=x) $$ or more generally, the expected avalue of a function of $X$ $$\tag2E(f(X)) = \sum_{x\in\Omega}f(x)\cdot P(X=x).$$ These are just finite sums (hence always work) if $X$ is a finite random variable. If $\Omega$ is countable, we can use the same formulas, but have series instad of sums, and it may happen that the series does not converge. For example $E(Y)=2$, but $E((-2)^Y)$ does not converge. It becomes even worse when $P(X=x)=0$ for all $x\in\Omega$ as then the sums/series above simply result in $0$. The sums/series are simply replaced with corresponding integrals $$E(Z)=\int_0^1 x\,\mathrm dx =\frac12, \qquad E(f(Z))=\int_0^1 f(x)\,\mathrm dx.$$ Again, the second integral does not make sense for every possible $f$, it must be integrable.

The step from sum to (first series and then) integral may look arbitrary, but it is indeed well-founded in measure theory - often enough one adjusts in the other direction and also writes series and sums as integrals (with respect to specific measures).

All this may still not be enough to grasp the formula you posted, but it should help you get started with the introductory texts you already tried to read.