Intuitive explanation of entropy

I have bumped many times into entropy, but it has never been clear for me why we use this formula:

If $X$ is random variable then its entropy is:

$$H(X) = -\displaystyle\sum_{x} p(x)\log p(x).$$

Why are we using this formula? Where did this formula come from? I'm looking for the intuition. Is it because this function just happens to have some good analytical and practical properties? Is it just because it works? Where did Shannon get this from? Did he sit under a tree and entropy fell to his head like the apple did for Newton? How do you interpret this quantity in the real physical world?


Solution 1:

Here's one mildly informal answer.

How surprising is an event? Informally, the lower probability you would've assigned to an event, the more surprising it is, so surprise seems to be some kind of decreasing function of probability. It's reasonable to ask that it be continuous in the probability. And if event $A$ has a certain amount of surprise, and event $B$ has a certain amount of surprise, and you observe them together, and they're independent, it's reasonable that the amount of surprise adds.

From here it follows that the surprise you feel at event $A$ happening must be a positive constant multiple of $- \log \mathbb{P}(A)$ (exercise; this is related to the Cauchy functional equation). Taking surprise to just be $- \log \mathbb{P}(A)$, it follows that the entropy of a random variable is its expected surprise, or in other words it measures how surprised you expect to be on average after sampling it.

Closely related is Shannon's source coding theorem, if you think of $- \log \mathbb{P}(A)$ as a measure of how many bits you need to tell someone that $A$ happened.

Solution 2:

We want to define a measure of the amount of information a discrete random variable produces. Our basic setup consists of an information source and a recipient. We can think of our recipient as being in some state. When the information source sends a message, the arrival of the message causes the recipient to go to a different state. This "change" is exactly what we want to measure.

Suppose we have a set of $n$ events with respectively the following probabilities

$$p_1,p_2,...,p_n.$$

We want a measure of how much choice we are to make, how uncertain are we?

Intuitively, it should satisfy the following four conditions.

Let $H$ be our "measure".

  1. $H$ is continous at every $p_i$

  2. If $p_i = 1$, then $H$ is minimum with a value of $0$, no uncertainty.

  3. If $p_1 = p_2= \dots = p_n$, i.e. $p_i=\frac{1}{n}$, then $H$ is maximum. In other words, when every outcome is equally likely, the uncertainty is greatest, and hence so is the entropy.

  4. If a choice is broken down into two successive choices, the value of the original $H$ should be the weighted sum of the value of the two new ones.

    An example of this condition $4$ is that $$H\left(\frac1{2}, \frac1{3}, \frac{1}{6} \right) = H\left(\frac{1}{2}, \frac{1}{2} \right) + \frac{1}{2} H\left(1 \right) + \frac{1}{2} H\left(\frac{2}{3}, \frac{1}{3} \right)$$

Here we decided to either take the a) first element or b) one of the other two elements. Then in a) we had no further decision, but for b) we had to decide which of those two to take.

The only $H$ satisfying the conditions above is:

$$H = −K\sum^n_{i=1}p_i log(pi)$$

To see that this definition gives what we intuitively would expect from a "measure" of information, we state the following properties of $H$.

  1. $H = 0 \iff p_i = 1$ and $p_j= 0, \forall j \neq i$
  2. $\forall n \in N$, $H$ is maximum when $p_1=,\cdots,= p_n$
  3. Suppose $x$ and $y$ are two events with $x \in R^n$, $y \in R^m$ and $p(i,j)$ is the probability that $x$ and $y$ jointly occur (i.e. occur at the same time).

    • $H(x, y) = −\sum_{i, j} p(i, j) \log(p(i, j))$

    • $H(x, y) \leq H(x) + H(y)$.

      With equality only if the occurrences are independent.

    • $H_x(y) = −\sum_{i, j} p_i(j) \log(p_i(j))= H(x, y) − H(x).$

      The entropy of $y$ when $x$ is known.

    • $H(y) \geq H_x(y)$.

      The entropy of $y$ is never increased by knowing $x$.

  4. Any change towards equalization of the probabilities increases $H$. Greater uncertainty $\Rightarrow$ greater entropy.

Here is a post with some illustrative R code

Solution 3:

Let me give you an intuitive (rather than mathematically rigorous) interpretation of entropy, denoted $H(X)$.

Let me start by giving you my interpretation first and then let me justify it.

Entropy can be viewed as the cost of encoding a specific distribution $X$.

Since I will describe it in terms of encoding messages, let me change the notation to make the description more intuitive. We want to transmit some message $(M=m)$ across some channel $C$. Intuitively, the cost of sending a messages across a channel is the length of the encoding of the message $m$. i.e. the longer the message, the more it will cost us to send the message since we have to send more (bits) of information. The frequency (and the probability) of getting each message is dictated by the language $\mathcal{L}$ which the message came from. For example, the language could be $\mathcal{L} = English$, were the word "the" is probably relatively common (i.e. high frequency and high probability) and thus, we should choose wisely how to encode this, since we will have to send it very often (or in the case of English, write it pretty pretty often!). So we want an efficient encoding for "the". By efficient, we want it to mean choosing a encoding that happens to choose less number of "stuff" (or information, bits etc) that we need to send through the channel. Since the messages we have to send are somewhat random, then its seems reasonable that we aim to send the least amount of bits that we can, at least on average. i.e intuitively, we want to minimize:

$$ E[ |M|] = \sum_m Pr[M=m]|m|$$

where $|m|$ denotes the length of the encoding of message m.

For example, we might want to encode it this way: for common (high probability) messages lets use fewer bits of information to encode them since we have to send them very frequently. So we can encode them based of the relative frequency dictated by the distribution for $\mathcal{L}$. With a little more thought you can come up with Huffman coding or some other scheme similar to it, if you make sure that the messages can be decoded unambiguously, the main idea in my opinion is to encode frequent words with short code lengths and infrequent ones with longer code lengths.

It turns out that Shannon proved that the notion of entropy provides a precise lower bound for the expected number of bits required to encode instances/messages sampled from $P(M)$. i.e. if we consider any proper codebook for values of $M \in \mathcal{L}$, then the expected code length, relative to the distribution $P(M)$, cannot be less than the entropy $H(M)$:

$$H(M) \leq E[|M|]$$

Since there exists a scheme that makes this inequality tight, then we can expect to encode the messages $M$ as efficiently as possible (on average).

Thus, returning to the interpretation I suggested. Since, the cost of encoding something can be thought of as the number of bits we need to send through a channel, and the optimum value (entropy) can be achieved, then entropy becomes the expected cost of encoding a distribution of messages.

(or if you want to view it from the inequalities perspective, its the best/minimum expected cost you can have to encode any known distribution $P(M)$.)

Solution 4:

Here is a simple intuitive explanation of Shannon entropy.

The telegraph message "SOS" is encoded as "...---..." in Morse code. The thing to note is that the massage is made up of letters from the alphabet but what is transmitted down the communication line are only dots and dashes. The message is written in the alphabet but transmitted in dots and dashes. Morse code maps letters to dots and dashes.

Text messages, emails and instant messaging are all written in text (i.e. upper and lower case letters, space, tab, decimal digits, punctuation marks, etc) but transmitted as bits {0, 1}. For the mathematician the problem of communication is of finding the most efficient way of mapping text messages to streams of bits. By most efficient I mean the least number of bits. If I have a text message of 100 characters what is the smallest number of bits I need to transmit down the line?

From these examples we can see that message transmission involves 2 sets A and B. The message is a sequence of letters from set A but the communication line can only transmit characters from set B. Let $m_A$ be a message written in A and let $m_B$ be the same message written in B. Let E be an encoding function that maps messages from A to messages from B.

$$m_B = E(m_A)$$

We can measure the size of the message by counting the number of characters in the message. The length of $m_A$ is $L(m_A)$ and the length of $m_B$ is $L(m_B)$.

Clearly a short message will have a sort encoding and a long message will have a long encoding. If we double the length of the message we will double the length of the encoding. The length of the encoding will be proportional to the length of the message.

$$L(m_B) \varpropto L(m_A)$$

By introducing a constant of proportionality k we can turn this into an equation.

$$L(m_B) = kL(m_A)$$

The problem of finding the most efficient encoding reduces to find the minimum possible value of k. This minimum value is the entropy of the set A measured over the set B.

If

$$A = \{A_1, A_2, A_3, ..., A_n\}$$

and the probability of $A_i$ being in the message is $p_i$ and B is the set

$$B = \{B_1, B_2, B_3, ..., B_m\}$$

then the entropy is given by

$$Entropy = \sum_{i=1}^n p_i\log_m(\frac{1}{p_i})$$

Note that log is taken to the base m which is the size of the set B.

So far I have dealt with the most general case but now I will switch to the simple case when each of the n characters in A are equally likely so $p_i = \frac{1}{n} \forall i$.

$$Entropy = \sum_{i=1}^n \frac{1}{n} \log_m(n)$$

This simplifies to

$$Entropy = \log_m(n)$$

In this case the entropy only depends on the of the sizes of A and B.

To prove this is correct function for the entropy we consider an encoding $E: A^r \rightarrow B^s$ that encodes blocks of r letters in A as s characters in B.

$$L(m_A) = r, \space L(m_B) = s, \space m_B = E(m_A)$$

The range of E must be greater than or equal to the size of the domain or otherwise two different messages in the domain would have to map to the same encoding in the range. The size of the domain is $ n^r $ and the size of the range is $ m^s $. We chose s to satisfy the following inequalities.

$$m^{(s-1)} \lt n^r \le m^s$$

The right hand inequality ensures the range is greater than or equal to the domain. The left hand inequality ensures this is the smallest such s that has this property.

Taking the log to base m of both side of these inequalities gives us.

$$\log_m(m^{(s-1)}) \lt \log_m(n^r) \le \log_m(m^s)$$

$$(s-1)\log_m(m) \lt r\log_m(n) \le s\log_m(m)$$

$$\frac{(s-1)}{r} \lt \log_m(n) \le \frac{s}{r}$$

The constant of proportionality k we introduced earlier is the ratio $\frac{s}{r}$.

$$k = \frac{s}{r}$$

So the right hand inequality proves

$$ \log_m(n) \le k $$

This proves that $\log_m(n)$ is a lower bound for k but how close can an encoding come to the lower bound? We note that

$$\frac{(s-1)}{r} \lt \log_m(n) \le \frac{s}{r}$$

implies

$$\frac{s}{r}-\log_m(n) \lt \frac{s}{r}-\frac{(s-1)}{r}$$

$$\frac{s}{r}-\log_m(n) \lt \frac{1}{r}$$

$$k-\log_m(n) \lt \frac{1}{r}$$

We can make k as close as we like to $\log_m(n)$ by increasing the block size r.

This finishes our treatment of the special case of equal probabilities. Hopefully having proved that in this case the entropy is $ \log_m(n) $ the more general formula should not come as too much of a surprise.

I hope you find this simple explanation more intuitive than the usual approach. I searched the internet for an explanation of entropy but didn't like any of the results so I came up with my own. It took me 5 years but now I understand Shannon entropy.

Solution 5:

The three postulates in this answer are the ones used in Shannon's original 1948 paper. If you skip over to Appendix II in that paper, you can find the remainder of the derivation.

  1. Derive the expression for $H \left(\tfrac{1}{n}, \tfrac{1}{n}, \ldots, \tfrac{1}{n} \right)$ as $-K \log n$.

  2. If all the $p_i$'s are rational, we can find an $m$ such that $m p_i \in \mathbb{N}, \forall i$. Now, use postulate $3$ to derive the entropy formula.

  3. Using the continuity postulate (first postulate), you can directly extend the formula to the case where the $p_i$'s are not necessarily rational.