Gaussian Mixture Model: What is a "universal approximator of densities"?

It's a vague/fancy way to say the following simple result.

But first...some definitions and facts!:

  1. Let $\mathcal{P}(\mathbb{R}^d)\subset \mathcal{M}_f(\mathbb{R}^d)$ be the set of probability (resp. finite signed) Borel measures on $\mathbb{R}^d$ (with its Euclidean topology).
  2. Equip $\mathcal{M}_f(\mathbb{R}^d)$ with the weak$^{\star}$ topology (somewhat confusingly known in statistics as the topology of weak convergence/convergence in distribution).
  3. The point-masses $\{\delta_x\}_{x \in \mathbb{R}^d}$ span a dense subset of $\mathcal{M}_f(\mathbb{R}^d)$.
    Note: This is basically why we can to empirical estimation of probability measures.
  4. In this topology, given any $x \in \mathbb{R}^d$, the sequence of Gaussians $(\nu_{x,n})_{n \in \mathbb{N}}$ with mean $x$ and variance $\frac1{2^n}$ converge to $\delta_x$.
    Note: Here just look at the characteristic funciton/ Fourier transforms.
  5. This means that the closed span of the Gaussian measures contains the closed span of the point-masses, which was $\mathcal{M}_f(\mathbb{R}^d)$.
  6. The projection of $\mathcal{M}_f(\mathbb{R}^d)-\{0\}\to \mathcal{P}(\mathbb{R}^d)$ taking a measure $\mu\mapsto \frac{1}{\mu_+(\mathbb{R}^d)}\mu_+$ is a continous surjection; whence it takes dense sets to dense sets. Moreover, a direct computation shows that it takes linear combinations to convex combinations.

Conlusion:

"Gaussian mixtures" (a.k.a. convex combinatoins of Gaussian measures) are dense in $\mathcal{P}(\mathbb{R}^d)$ For the weak$^{\star}$ topology!

Bonuses, fun, and excitment for all: Try breaking the argument and showing that linear combinations of Gaussians cannot approximate (non-trivial) point-masses for stronger topologies on $\mathcal{M}_f(\mathbb{R}^d)$. In other words....they aren't capable of approximating all probability measures for the total variation topology (another rather standard notion of convergence of measures).


If somebody stumbles across this question, here are some results I was able to find after all. See also this reddit thread.

  • Linear combinations of Gaussians are dense in L2, i.e., can be made to approximate any square-integrable function arbitrary closely, according to [1]. Kudos to /u/LeChatTerrible on reddit for finding this
  • I think that [2] with its "Approximation Theorem" pretty much shows what I need.

[1] https://arxiv.org/pdf/0805.3795.pdf

[2] Kostantinos N. Plataniotis and Dimitris Hatzinakos. 2000. Gaussian mixtures and their applications to signal processing. In Advanced Signal Processing Handbook: Theory and Implementation for Radar, Sonar, and Medical Imaging Real Time Systems, Stergios Stergiopoulos (Ed.). CRC Press, Boca Raton, Chapter 3.