The Kullback-Leibler Divergence Proof of Exact Same Distribution

The Kullback-Leibler Divergence (KL Divergence) is defined as follows, for the distributions $P,Q$ and the outcome space $\mathcal X$, $$D_{\text{KL}}(P\parallel Q)=\mathbb E_{x\sim P}\left[\log\dfrac{P(x)}{Q(x)}\right]=\sum_{x\in \mathcal X} P(x)\log\left(\dfrac{P(x)}{Q(x)}\right).$$ I saw somewhere said that with a little math (I retain here that $\mathcal X$ is a space of discrete RV) if $$D_{\text{KL}}(P\parallel Q)=0\Longrightarrow P,Q \text{ are the exact same distribution. }$$

I'm not sure with the right term above but maybe this? For discrete RV $X$ where $P(\cdot=\cdot'),Q(\cdot=\cdot')$ are the probabilty corresponding to the distribution $P,Q$ respectively, then $\forall x\in\mathcal X$ $$P(X=x)=Q(X=x).$$ I tried to prove this claim but I think I could only for the rational probability. How can we prove it for all cases? I wrote here the case of $2$ possible outcomes, i.e. the number of outcomes $\#\mathcal X=2$ and this method can extended to any finite number.

Therefore, let $P$s, $Q$s be the probabilities, $$D_{\text{KL}}(P\parallel Q)=P_1\log\dfrac{P_1}{Q_1}+P_2\log\dfrac{P_2}{Q_2}, P_1+P_2=1=Q_1+Q_2$$

$$D_{\text{KL}}(P\parallel Q)=0\Longleftrightarrow\log\left[\left(\dfrac{P_1}{Q_1}\right)^{P_1}\left(\dfrac{P_2}{Q_2}\right)^{P_2}\right]=0\Longleftrightarrow P_1^{P_1}P_2^{P_2}=Q_1^{P_1}Q_2^{P_2}$$ If we let $P_1=\frac{r}{q}, r<q,$ the right-handed side is, by AM-GM inequality,

$$\dfrac{1}{q^q}=\dfrac{Q_1^rQ_2^{q-r}}{r^r(q-r)^{q-r}}\le \left(\dfrac{\overbrace{\dfrac{Q_1}{r}+\dfrac{Q_1}{r}+\cdots+\dfrac{Q_1}{r}}^{r}+\overbrace{\dfrac{Q_2}{q-r}+\dfrac{Q_2}{q-r}+\cdots \dfrac{Q_2}{q-r}}^{q-r}}{q}\right)^q=\dfrac{1}{q^q}$$

Hence, by the equality of AM-GM we obtain $Q_1/r=Q_2/(q-r)\Longrightarrow P_1=Q_1$ and $P_2=Q_2$. How can we show that this is the case for irrational number? I don't know how to apply the inequality for such cases.


Solution 1:

This is a special case of Gibb's inequality, which asserts KL divergence between discrete distributions is always nonnegative, and is zero iff distributions are the same.

To deal with zero masses, note that KL divergence is only defined if the zeros of $Q$ are a subset of the zeros of $P$; further, we would assign a summand in the KL divergence to be zero at a given $x$ if $P(x)=0$. Thus, we can restrict our sum to $S\equiv \{x\in\mathcal{X}:P(x)>0\}.$

We show Gibb's inequality by appealing to Jensen's inequality, since log is concave:

$$D_{\text{KL}}(P\|Q)=-\mathbb{E}_{X\sim P}\left[\log\frac{Q(X)}{P(X)}\right]\geq -\log \mathbb{E}_{X\sim P}\left[\frac{Q(X)}{P(X)}\right]=-\log \sum_{x\in S}Q(x)\geq 0,$$

where the final inequality comes from $\sum_{x\in S}Q(x)\leq 1 $ (recall we dropped any $x:P(x)=0$).

Jensen's inequality holds with equality iff the transformation is affine on some convex set occurring with prob. 1; this subsumes the case where the object whose expectation is taken is a.s. constant.

Since log is strictly concave, KL divergence is thus zero iff

  1. $Q(x)=P(x)\forall x\in S$
  2. $\sum_{x\in S}Q(x)= 1$.

Note that since $Q$ is zero only if $P$ is zero, 1 and 2 occur iff $Q(x)=P(x)\forall x\in\mathcal{X}.$