Intuition for probability density function as a Radon-Nikodym derivative

Your understanding of the basic math itself seems pretty solid, so I'll just try to provide some extra intuition.

When we integrate a function $g$ with respect to the Lebesgue measure $\lambda$, we find its "area under the curve" or "volume under the surface", etc... This is obvious since the Lebesgue measure assigns the ordinary notion of length (area, etc) to all possible integration regions over the domain of $g$. Therefore, I say that integrating with respect to the Lebesgue measure (which is equivalent in value to Riemannian integration) is a calculation to find the "volume" of some function.

Let's pretend for a moment that when performing integration, we are always forced to do it over the entire domain of the integrand. Meaning we are only allowed to compute $$\int_B g \,d\lambda\ \ \ \ \text{if}\ \ \ \ B=\mathbb{R}^n$$ where $\mathbb{R}^n$ is assumed to be the entire domain of $g$.

With that restriction, what could we do if we only cared about the volume of $g$ over the region $B$? Well, we could define an indicator function for the set $B$ and integrate its product with $g$, $$\int_{\mathbb{R}^n} \mathbf{1}_B g \,d\lambda$$

When we do something like this, we are taking the mindset that our goal is to nullify $g$ wherever we don't care about it... but that isn't the only way to think about it. We can instead try to nullify $\mathbb{R}^n$ itself wherever we don't care about it. We would compute the integral then as, $$\int_{\mathbb{R}^n} g \,d\mu$$ where $\mu$ is a measure that behaves just like $\lambda$ for Borel sets that are subsets of $B$, but returns zero for Borel sets that have no intersection with $B$. Using this measure, it doesn't matter that $g$ has value outside of $B$, because $\mu$ will give that support no consideration.

Obviously, these integrals are just different ways to think about the same thing, $$\int_{\mathbb{R}^n} g \,d\mu = \int_{\mathbb{R}^n} \mathbf{1}_B g \,d\lambda$$ The function $\mathbf{1}_B$ is clearly the density of $\mu$, its Radon–Nikodym derivative with respect to the Lebesgue measure, or by directly matching up symbols in the equation, $$d\mu = f\,d\lambda$$ where here $f = \mathbf{1}_B$. The reason for showing you all this was to show how we can think of changing measure as a way to tell an integral how to only compute the volume we care about. Changing measure allows us to discount parts of the support of $g$ instead of discounting parts of $g$ itself, and the Radon–Nikodym chainrule formalizes their equivalence.

The cool thing about this, is that our measures don't have to be as bipolar as the $\mu$ I constructed above. They don't have to completely not care about support outside $B$, but instead can just care about support outside $B$ less than inside $B$.

Think about how we might find the total mass of some physical object. We integrate over all of space (the entire domain where particles can exist) but use a measure $m$ that returns larger values for regions in space where there is "more mass" and smaller values (down to zero) for regions in space where there is "less mass". It doesn't have to be just mass vs no-mass, it can be everything in between too, and the Radon–Nikodym derivative of this measure is indeed the literal "density" of the object.

So what about probability? Just like with the mass example, we are encroaching on the world of physical modeling and leaving abstract mathematics. Formally, a measure is a probability measure if it returns 1 for the Borel set that is the union of all the other Borel sets. When we consider these Borel sets to model physical "events", this notion makes intuitive modeling sense... we are just defining the probability (measure) of anything happening to be 1.

But why 1? Arbitrary convenience. In fact, some people don't use 1! Some people use 100. Those people are said to use the "percent" convention. What is the probability that if I flip this coin, it lands on heads or tails? 100... percent. We could have used literally any positive real number, but 1 is just a nice choice. Note that the Lebesgue measure is not a probability measure because $\lambda(\mathbb{R}^n) = \infty$.

Anyway, what people are doing with probability is designing a measure that models how much significance they give to various events - which are Borel sets, which are regions in the domain; they are just defining how much they value parts of the domain itself. As we saw before with the measure $\mu$ I constructed, the easiest way to write down your measure is by writing its density.

Fun to note: "expected value" of $g$ is just its volume with respect to the given probability measure $P$, and "covariance" of $g$ with $h$ is just their inner product with respect to $P$. Letting $\Omega$ be the entire domain of both $g$ and $h$ (also known as the sample space), if $g$ and $h$ have zero mean, $$\operatorname{cov}(g, h) = \int_{x \in \Omega}g(x)h(x)f(x)\ dx = \int_{\Omega}gh\ dP = \langle g, h \rangle_P$$

I'll let you show that the correlation coefficient for $g$ and $h$ is just the "cosine of the angle between them".

Hope this helps! Measure theory is definitely the modern way of viewing things, and people began to understand "weighted Riemannian integrals" well before they realized the other viewpoint: "weighting" the domain instead of the integrand. Many people attribute this viewpoint's birth to Lebesgue integration, where the operation of integration was first (notably) restated in terms of an arbitrary measure, as opposed to Riemnnian integration which tacitly always assumed the Lebesgue measure.

I noticed you brought up the normal distribution specifically. The normal distribution is special for a lot of reasons, but it is by no means some de-facto probability density. There are an infinite number of equally valid probability measures (with their associated densities). The normal distribution is really only so important because of the central limit theorem.

The case you are referring to is valid. In your example, Radon-Nikodym serves as a reweighting of the Lebesgue measure and it turns out that the Radon-Nikodym is the pdf of the given distribution.

However, Radon-Nikodym is a more general concept. Your example converts Lebesgue measure to a normal probability measure whereas Radon-Nikodym can be used to convert any measure to another measure as long as they meet certain technical conditions.

A quick recap of the intuition behind measure. A measure is a set function that takes a set as an input and returns a non-negative number as output. For example length, volume, weight, and probability are all examples of measures.

So what if I have one measure that returns length in meters and another measure that returns length in kilometer? A Radon-Nikodym is to convert these two measures. What is the Radon-Nikodym in this case? It is a constant number 1000.

Similarly, another Radon-Nikodym can be used to convert a measure that returns the weight in kg to another measure that returns the weight in lbs.

Back to your example, pdf is used to convert a Lebesgue measure to a normal probability measure, but this is just one example usage of measure.

Starting from a Lebesgue measure, you can define Radon-Nikodym that generates other useful measures (not necessarily probability measure).

Hope this clarifies it.

Intuition for probability density function as a Radon-Nikodym derivative

Related

Recent Posts