What is the motivation of Measure Theory when there is probability theory?

In my undergraduate studies, when probability was taught to me, it was taught to me starting from Probability Theory. However, when I go onto higher level of studies, probability gets taught using Measure Theory. Both end up in teaching distributions and expectations but why is there a different approach?

Really interested to know the reason behind using Measure Theory instead of Probability Theory.

Sorry if it is a noob question.


I think the situation is similar to that in algebra. In elementary school, you learned that $1+1=2$. It was kinda obvious, right? In rigorous advanced algebra, however, you first have to define “$1$”, “$2$”, “$+$” and then you must prove that $1+1=2$.

Similarly, probability theory at an undergraduate level uses some informal but intuitively sound notions when introducing the basics and how those foundations are built are largely left unsaid, presumably because the focus at this level ought to be on more interesting topics relying on these basics, such as combinatorics, distribution theory, statistics, practical applications, and so forth.

Only at a more advanced level do you realize that the foundations of probability theory are basically the same as those of measure theory under the special assumption that the measure of the whole space is normalized to one. The constructions and results from measure theory help you build a rigorous and consistent theory about what events and probabilities really are. The point is that at this higher level, there are no loose ends left and informal concepts that you were accustomed to during your undergraduate training (and accepted them without many reservations, since they felt intuitively right) are placed on rock-solid theoretical grounds.


Measure theory gives a unified mathematical and conceptual framework for general probability theory.

The two classic scenarios in probability theory are the discrete and continuous measures, which are treated quite separately:

In the discrete case, we have a finite or countable space $\Omega$ and assign a probability $p(x)$ to each point $x \in \Omega$, so the probability of any event (subset) $A$ is just $\sum_{x \in A} p(x)$.

In the continuous case, we have a subset $\Omega$ of Euclidean space with a probability density function $\rho : \Omega \to \mathbb R$. The probability of an event $A$ is taken to be $\int_A \rho(x) d x$.

Measure-theoretic probability theory is a way to unify and generalise these two situations - we now let the probability space be given by a measure space $(\Omega,\Sigma,\mu)$, and take the probability of an event $A \in \Sigma$ to be $\int_A d\mu$. By taking $\mu(\{x\}) = p(x)$ we recover the discrete case, and taking $\mu = \rho \lambda$ ($\lambda$ the usual Lebesgue measure) we recover the continuous case. Furthermore we can use more general measures, e.g. the sum of a continuous distribution and a point mass.


Adding some flavor to the "what felt intuitively right" by triple_sec: adding an infinite number of events is intuitively assumed to work in entry-level probability texts.

Ex.: The strong law of large numbers i.e. tossing infinitely many fair coins infinitely many times yields a probability of 1/2 for heads (tails) on a toss. But then one thinks: What is the probability of a certain infinitely-long sequence?, The sequence of all heads (HHHH...) doesn't really seem to fit in? Standard probability struggles here, measure theory ($\sigma$-additivity) beautifully addresses this.