Relation between Shannon Entropy and Total Variation distance

Solution 1:

Below the second part is rather inelegant, and I think this can possibly be improved. Suggestions are welcome.


Note that the LHS is the Jensen-Shannon divergence ($\mathrm{JSD}$) between $P_1$ and $P_2$, and that $\mathrm{JSD}$ is a $f$-divergence. For $f$-divergences generated by $f, g$ the joint ranges of $D_f,D_g$ are defined as \begin{align} \mathbb{R}^2 \supset \mathcal{R} :=& \{ (D_f(P\|Q), D_g(P\|Q)): P, Q \textrm{ are distributions on some measurable space} \} \\ \mathcal{R}_k :=& \{ (D_f(P\|Q), D_g(P\|Q)): P, Q \textrm{ are distributions on } ([1:k], 2^{[1:k]} ) \}\end{align}

A remarkable theorem of Harremoees & Vajda (see also these notes by Wu) states that for any pair of $f$-divergences, $$\mathcal{R} = \mathrm{co}(\mathcal{R}_2),$$ where $\mathrm{co}$ is the convex hull operator.

Now, we want to show the relation $\mathrm{JSD} \le h(d_{TV}).$ Since both $\mathrm{JSD}$ and $d_{TV}$ are $f$-divergences, and since the set $\mathcal{S} := \{ y - h(x) \le 0\}$ is convex in $\mathbb{R}^2$, it suffices to show this inequality for distributions on $2$-symbols, since by the convexity we have $\mathcal{R}_2 \subset \mathcal{S} \implies \mathrm{co}(\mathcal{R}_2) \subset \mathcal{S},$ as the convex hull of a set is the intersection of all convex sets containing it. The remainder of this answer will thus concentrate on showing $\mathcal{R}_2 \subset \mathcal{S}$.


Let $p := \pi + \delta, q:= \pi - \delta,$ where $\delta \in [0,1/2]$ and $\pi \in [\delta, 1- \delta].$ We will show that $$ \mathrm{JSD}(\mathrm{Bern}(p)\|\mathrm{Bern}(q) ) \le h\left(\frac{1}{2}d_{TV}(\mathrm{Bern}(p)\|\mathrm{Bern}(q) )\right) = h(\delta), \tag{1}$$ which suffices to show the relation on $2$-letter distributions. Note that above $p\ge q$ always, but this doesn't matter since both $\mathrm{JSD}$ and $d_{TV}$ are symmetric in their arguments.

For conciseness I'll set represent the $\mathrm{JSD}$ above by $J$. All '$\log$'s in the following are natural, and we will make use of the simple identities for $p \in (0,1)$ $$ \frac{\mathrm{d}}{\mathrm{d}p} h(p) = \log \frac{1-p}{p} \\ \frac{\mathrm{d}^2}{\mathrm{d}p^2} h(p) = -\frac{1}{p} - \frac{1}{1-p}. $$

By the expansion in the question, $$J(\pi, \delta) = h( \pi) - \frac{1}{2} h(\pi + \delta) - \frac{1}{2} h(\pi - \delta).$$

It is trivial to see that the relation $(1)$ holds if $\delta = 0$. Let us thus assume that $\delta > 0.$ For $\pi \in (\delta, 1-\delta),$ we have

\begin{align} \frac{\partial}{\partial \pi} J &= \log \frac{1-\pi}{\pi} - \frac{1}{2} \left( \log \frac{1 - \pi - \delta}{\pi + \delta} + \log \frac{1 - \pi +\delta}{\pi - \delta}\right) \end{align} and \begin{align} \frac{\partial^2}{\partial \pi^2} J &= \frac{1}{2} \left( \frac{1}{\pi + \delta} + \frac{1}{\pi - \delta} + \frac{1}{1 - \pi - \delta} + \frac{1}{1 - \pi + \delta} \right) - \frac{1}{\pi} - \frac{1}{1-\pi} \\ &= \frac{\pi}{\pi^2 - \delta^2} - \frac{1}{\pi} + \frac{1 - \pi}{( 1-\pi)^2 - \delta^2} - \frac{1}{1-\pi} \\ &= \frac{\delta^2}{\pi(\pi^2 - \delta^2)} + \frac{\delta^2}{(1-\pi)( (1-\pi)^2 - \delta^2)} > 0, \end{align}

where the final inequality uses $\delta > 0,$ and that $ \pi \in (\delta, 1-\delta).$

As a consequence, for every fixed $\delta >0,$ $J$ is strictly convex on $(\delta, 1-\delta).$ Since the maxima of a convex function on an interval must lie on the end points, we have $$ J(\pi ,\delta) \le \max( J(\delta, \delta), J(1- \delta, \delta) ).$$

But $$J(\delta, \delta) = h(\delta) - \frac{1}{2} (h(2\delta) + h(0) ) = h(\delta) - \frac{1}{2} h(2\delta),$$ and similarly $$J(1-\delta, \delta) = h(\delta) - \frac{1}{2} h(1-2\delta) = h(\delta) - \frac{1}{2} h(2\delta),$$ by the symmetry of $h$. We immediately get that for every $\delta \in [0,1/2], \pi \in [\delta, 1-\delta],$ $$J(\pi, \delta) \le h(\delta) - \frac{1}{2} h(2\delta) \le h(\delta),$$ finishing the argument.


Note that the last line indicates something stronger for $2$-symbol distributions: $J(\pi, \delta) \le h(\delta) - h(2\delta)/2$. Unfortunately the RHS is a convex function of $\delta$, so this doesn't directly extend to all alphabets. It'd be interesting if a bound that has such an advantage can be shown in general.