What does the $-\log[P(X)]$ term mean in the calculation of entropy?

Solution 1:

Easy illustrative example:

Take a fair coin. $P({\rm each\ result})=1/2$. By independence, $P({\rm each\ result\ in\ n\ tosses})=1/2^n$. The surprise in each coin toss is the same. The surprise in $n$ tosses is $n\times$(surprise in one toss). The $\log$ makes the trick. And the entropy is the mean surprise.

Solution 2:

In his 1948 paper Claude Shannon introduced the entropy $H$ of a discrete random variable $X$ with probabilities $p_1, \dots, p_n$ as a function which satisfied three requirements, which should provide a measure of the information contained in $X$:

  1. $H$ should be continuous in the $p_i$.
  2. If all the $p_i$ are equal, $p_i = \frac{1}{n}$, then $H$ should be a monotonic increasing function of $n$. With equally likely events there is more choice, or uncertainty, when there are more possible events.
  3. If a choice be broken down into two successive choices, the original $H$ should be the weighted sum of the individual values of $H$.

He further explains what property 3 means with a nice example. Then, in appendix 2, he shows that only a function of the form $$K \sum_{i=1}^n p_i \log(p_i)$$ can satisfy all these three requirements, where $K$ is some multiplicative constant.

Solution 3:

Assume that one repeatedly draws values from a finite set $S$ of size $|S|$ according to a distribution $p=(p_x)_{x\in S}$. After one draw, there are $|S|$ possible results, after two draws there are $|S|^2$, and so on, so one can get the impression that after $n$ draws, the resulting distribution is spread out on the Cartesian product $S^n$, whose size is $|S|^n$. And indeed it is, but this view is deceptive because the distribution is extremely unevenly spread out on $S^n$. Actually:

There exists a subset $T_n\subset S^n$, often much smaller than $S^n$, on which nearly all the distribution of the sample of size $n$ is concentrated. And in this "vanishingly small" subset $T_n$, the weight of each element is roughly the same...

In other words, everything happens as if the combined result of the $n$ first draws was chosen uniformly randomly in $T_n$. What connects the dots is that the size of $T_n$ is $\mathrm e^{nH}$ for some deterministic finite number $H$. (Actually, the size of $T_n$ is $\mathrm e^{nH+o(n)}$.) Surely you recognized that $H$ is the entropy of the distribution according to which one is drawing the values from $S$, that is, $$ H=-\sum_{x\in S}p_x\log p_x=-E[\log p_X], $$ where $X$ is any random variable with distribution $p$.

This surprisingly general phenomenon, related to what is called concentration of measure, quantifies $\mathrm e^H$ as the (growth of the) effective size of the sample space. As direct consequences, $0\leqslant H\leqslant\log|S|$, $H=0$ if and only if $p$ is a Dirac measure and $H=\log|S|$ if and only if $p$ is uniform.