Why is the Kullback-Leibler divergence not symmetric?
As known the Kullback-Leibler Divergence:
$$\operatorname{KL}=\sum_{i=1}^n \ln(\frac{P(i)}{Q(i)})P(i)$$
is not symmetric. I would like to know how this can be seen from the formula. I am aware that I could just try it out with exchaning Q and P for some special case, but I would like to know the mathematical reason behind it. Also, is actually here "i" the random variable?
Thanks a lot Miau
In addition to the algebraic reason that Robert Israel gave, there's a very nice "moral reason" that the Kullback-Leibler divergence is not symmetric. Roughly speaking, it's because you should think of the two arguments of the KL divergence as different kinds of things: the first argument is empirical data, and the second argument is a model you're comparing the data to. Here's how it works.
Take a bunch of independent random variables $X_1, \ldots, X_n$ whose possible values lie in a finite set.* Say these variables are identically distributed, with $\operatorname{Pr}(X_i = x) = p_x$. Let $F_{n,x}$ be the number of variables whose values are equal to $x$. The list $F_n$ is a random variable, often called the "empirical frequency distribution" of the $X_i$. What does $F_n$ look like when $n$ is very large?
More specifically, let's try to estimate the probabilities of the possible values of $F_n$. Since the set of possible values is different for different $n$, take a sequence of frequency distributions $f_1, f_2, f_3, \ldots$ approaching a fixed frequency distribution $f$. It turns out** that $$\lim_{n \to \infty} \tfrac{1}{n} \ln \operatorname{Pr}(F_n = f_n) = -\operatorname{KL}(f, p).$$ In other words, the Kullback-Leibler divergence of $f$ from $p$ lets you estimate the probability of getting an empirical frequency distribution close to $f$ from a large number of independent random variables with distribution $p$.
You can find everything I just said, and more, in the excellent article "Information Theory, Relative Entropy and Statistics," by François Bavaud.
* You can also do this more generally, but I don't know anything about that.
** Using Stirling's approximation, $\ln k! \in k\ln k - k + O(\ln k)$.
$$KL(P,Q) - KL(Q,P) = \sum_{i=1}^n \ln\left(\frac{P(i)}{Q(i)}\right) (P(i) + Q(i))$$ and there is no reason for this to be $0$.
$i$ is not a random variable, it is a dummy index. However, there can be a random variable that takes value $i$ with probability $P(i)$, and another that takes the value $i$ with probability $Q(i)$.
While recently studying about KL divergence, I came across the following intuitive explanation from Ian Goodfellow via an example. I hope it gives an intuitive sense as to why KL divergence is not symmetric! NIPS-2016-GAN-Tutorial
You can just look at one term under the sum, forgetting the index $i$. So if you consider:
$$K(p,q) = \log\left(\frac{p}{q}\right)p$$
you can see there seems to be a lack of symmetry: $p$ appears two times, $q$ only once. And this matters: suppose you take a simple linear relationship between then, like $q=\alpha p$, then
$$K(p,\alpha p) = -\log\left(\alpha\right)p$$
while the symmetric is:
$$K(\alpha p, q) = \log\left(\alpha\right)p$$
which are clearly different if $p\ne 0$ or $\alpha \ne 1$.