What are the foundations of probability and how are they dependent upon a $\sigma$-field?

Solution 1:

Probability when there are only finitely many outcomes is a matter of counting. There are $36$ possible results from a roll of two dice and $6$ of them sum to $7$ so the probability of a sum of $7$ is $6/36$. You've measured the size of the set of outcomes that you are interested in.

It's harder to make rigorous sense of things when the set of possible results is infinite. What does it mean to choose two numbers at random in the interval $[1,6]$ and ask for their sum? Any particular pair, like $(1.3, \pi)$, will have probability $0$.

You deal with this problem by replacing counting with integration. Unfortunately, the integration you learn in first year calculus ("Riemann integration") isn't powerful enough to derive all you need about probability. (It is enough to determine the probability that your two rolls total exactly $7$ is $0$, and to find the probability that it's at least $7$.)

For the definitions and theorems of rigorous probability theory (those are the "foundations" you ask about) you need "Lebesgue integration". That requires first carefully specifying the sets that you are going to ask for the probabilities of - and not every set is allowed, for technical reasons without which you can't make the mathematics work the way you want. It turns out that the set of sets whose probability you are going to ask about carries the name "$\sigma$-field" or "sigma algebra". (It's not a field in the arithmetic sense.) The essential point is that it's closed under countable set operations. That's what the "$\sigma$" says. Your text may not provide a formal definition - you may not need it for NLP applications.

Solution 2:

To add more to the answer by Ethan Bolker and flesh this out, probability functions are defined on sets, representing events, i.e. some set of outcomes of which we're interested in the probability of whatever we are querying or observing to happen as falling into or not, e.g. the probability that the temperature at noon tomorrow will be in the range $[25, 30]$.

Every probability function, which assigns to each event set $E$, itself a subset of the total set of possible outcomes, or sample space, $S$, is required satisfy the following rules, called the Kolmogorov axioms. The reason for this is they capture the most basic rules of how we expect probabilities to behave intuitively:

  1. Rule 1: There are no negative probabilities. That is, for every event $E$ we have $P(E) \ge 0$. Since probabilities are meant to formalize the idea of "how many chances in..." is there for something to happen, it makes no sense to talk of a negative number of chances for the same reason that it makes no sense to talk of a negative number of apples. What does it mean to have -3 occurrences of something, or -6 apples held in my hand right now at this very moment in time?
  2. Rule 2: The probability of the entire sample space is 1. i.e. $P(S) = 1$. This should be intuitive, because at least some outcome must occur, and the set $S$ is the set of all possible outcomes, so whatever outcome occurs has to be within it. Thus the event $S$ will always occur no matter what.
  3. Rule 3: Probabilities of mutually exclusive events add. If we have an up-to-countable sequence of mutually exclusive events $E_1, E_2, E_3, \cdots$, i.e. that $E_i \cap E_j = \emptyset$ for all possible pairs with $i \ne j$, then we should have

$$P(E_1 \cup E_2 \cup E_3 ...) = \sum_{i=1}^{\infty} P(E_i)$$

Now as mentioned, we may not be able to assign every event a probability. For the case of a discrete sample space, i.e. where $S$ is a finite or at most countably infinite set, this may be doable. But for continuous sample spaces (e.g. $\mathbb{R}$), there are subtleties that make it difficult to define a useful probability function in most cases for most sets using methods that are convenient to use such as integration, and thus we must restrict the domain of $P$ to not all subsets of $S$, but only some selected amount, which we call the $\sigma$-field, usually denoted $\Sigma$. That is, $\mathrm{dom}(P) = \Sigma \subseteq 2^S$, and we are not thus allowed to consider events $E \notin \Sigma$. The definition of a $\sigma$-field is just whatever is required to ensure that with regard to the above definition, all the sets involved in it make sense. Which basically means we must have

  1. Because of rule 2, in order for us to have $P(S) = 1$ we need $S$ to be in the domain of $P$ in the first place, so we must have $S \in \Sigma$.
  2. While this second stipulation is not strictly speaking required simply to make the above definition valid, we typically take that the complement $\bar{E} = S \backslash E$ of any event should be in $\Sigma$. This is because very often we are interested in the probability of something NOT happening (e.g. the probability that a given number of people do NOT get better with some sort of medical treatment we are testing), and we want that question to make sense and thus must be able to have the event corresponding to this as an available input to our probability function $P$.
  3. Finally, so that rule 3 can make sense, given any countable sequence of members $E_1, E_2, E_3, \cdots \in \Sigma$ we must have $(E_1 \cup E_2 \cup E_3 \cup \cdots) \in \Sigma$.

And that's about it.