Is log the only choice for measuring information?

When we quantify information, we use $I(x)=-\log{P(x)}$, where $P(x)$ is the probability of some event $x$. The explanation I always got, and was satisfied with up until now, is that for two independent events, to find the probability of them both we multiply, and we would intuitively want the information of each event to add together for the total information. So we have $I(x \cdot y) = I(x) + I(y)$. The class of logarithms $k \log(x)$ for some constant $k$ satisfy this identity, and we choose $k=-1$ to make information a positive measure.

But I'm wondering if logarithms are more than just a sensible choice. Are they the only choice? I can't immediately think of another class of functions that satisfy that basic identity. Even in Shannon's original paper on information theory, he doesn't say it's the only choice, he justifies his choice by saying logs fit what we expect and they're easy to work with. Is there more to it?


We want to classify all continuous(!) functions $I\colon(0,1]\to\Bbb R$ with $I(xy)=I(x)+I(y)$. If $I$ is such a function, we can define the (also continouus) function $f\colon[0,\infty)\to \Bbb R$ given by $f(x)=I(e^{-x})$ (using that $x\ge 0$ implies $e^{-x}\in(0,1]$). Then for $f$ we have the functional equation $$f(x+y)=I(e^{-(x+y)})=I(e^{-x}e^{-y})=I(e^{-x})+I(e^{-y})=f(x)+f(y).$$ Let $$ S:=\{\,a\in[0,\infty)\mid \forall x\in[0,\infty)\colon f(ax)=af(x)\,\}.$$ Then trivially $1\in S$. Also, $f(0+0)=f(0)+f(0)$ implies $f(0)=0$ and so $0\in S$. By the functional equation, $S$ is closed under addition: If $a,a'\in S$ then for all $x\ge 0$, we have $$f((a+a')x)=f(ax+a'x)=f(ax)+f(a'x)=af(x)+a'f(x)=(a+a')f(x)$$ and so als $a+a'\in S$.

Using this we show by induction that $\Bbb N\subseteq S$: We have $1\in S$; and if $n\in S$ then also $n+1\in S$ (because $1\in S$).

Next note that if $a,b\in S$ with $b>0$ then for all $x\ge0$ we have $f(a\frac xb)=af(\frac xb)$ and $f(x)=f(b\frac xb)=bf(\frac xb)$, i.e., $f(\frac ab x)=\frac abf(x)$ and so $\frac ab\in S$. As $\Bbb N\subseteq S$, this implies that $S$ contains all positive rationals, $\Bbb Q_{>0}\subseteq S$.

In particular, if we let $c:=f(1)$, then $f(x)=cx$ for all $x\in \Bbb Q_{>0}$. As we wanted continuous functions, it follows that $f(x)=cx$ for all $x\in[0,\infty)$. Then $$ I(x)=f(-\ln x)=-c\ln x.$$

Remark: The request for continuity of $I$ (and hence $f$) is of course reasonable in the given context. But it turns out that much milder restrictons on $f$ suffice to enforce the result as found. It is only without any such restrictions that the Axiom of Choice supplies us with highly non-continuous additional solutions to the functional equation. The original remark that the logs fits what we expect and are easy to work with is quite an understatement if one even thinks of considering these non-continuous solutions.


I just wanted to point something out, but honestly, I think the other answers are far better given that this is a mathematics site. I'm just pointing it out to add another argument for why logarithm makes sense as the only choice.

You have to ask yourself what information even is. What is information?

Information is the ability to distinguish possibilities.1

1 Compare with energy in physics: the ability to do work or produce heat.

Okay, let's start reasoning.

Every bit (= binary digit) of information can (by definition) distinguish 2 possibilities, because it can have 2 different values. Similarly, every n bits of information can distinguish $2^n$ possibilities.

Therefore: the amount of information required to distinguish $2^n$ possibilities is $n$ bits.
And the same exact reasoning works regardless of whether you're talking about base 2 or 3 or e.
So clearly you have to take a logarithm if the number of possibilities is an integer power of the base.

Now, what if the number of possibilities is not a power of $b = 2$ (or whatever your base is)?
In this case you're looking for a function that coincides with the logarithm at the integer powers.

At this point, I would be convinced to use the logarithm itself (anything else would seem bizarre), but this is where a mathematician would invoke the reasonings mentioned in the other arguments (continuity or additivity for independent events or whatever) to show that no other function could satisfy reasonable criteria on information content.


My understanding is that $-\log$ provides a mapping $({\mathbb R}_{\geq 0},+,\cdot)\rightarrow({\mathbb R}\cup\{\infty\},\min,+)$ between semirings (multiplicatively a monoid homomorphism). It is monotonically decreasing and maps large probabilities to low weights and vice versa. This is used in certain statistical models such as sequence alignment and hidden Markov model. The mapping is sometimes referred to tropicalization. Have a look into the work of Bernd Sturmfels et al.


$\log$ or $\ln$ are definitely not the only way of measuring information - it depends upon what we understand by information. But the way we have chosen to define information (see below) confines us to using $\log$ or $\ln$.

I've tried to explain it here on stats.stackexchange.com. I'm pasting it below for quick reference.

There is a profound reason why the logarithm comes into picture, and it is not randomly chosen. The relationship between $\log$ and information stems from this simple way of writing any number $m$ (the symbols don't have any meaning yet), and the discussion that follows.

$$ m = \frac{1}{p} = 2^{i} \tag{1}$$

The above tells us that if we use exactly $i$ letters to encode a string where each letter can have one of 2 values at a time, we'll get $m$ different strings. A 2-valued letter is nothing else but a bit. So writing any number $m$ in this way brings into picture a property of the number - $i$ which can be used to construct the number $m$ again (uniquely) - using bits.

Now, it is easy to see that for a given outcome that has a probability $p$, the number of other outcomes in the same event that have probabilities greater than $p$ will always be less than or equal to $\frac{1}{p}$. For detail on this, check here.

This means that, as per $(1), $ $i=\log_2(\frac{1}{p})$ bits can be safely used to represent this outcome in an event unless there are lower probability outcomes. But even if there are lower probability outcomes, it is easy to see that we can still encode this outcome with $i=\log_2(\frac{1}{p})$ bits, and use more bits to encode the lower-$p$ outcomes. Check here for a detailed proof. In summary, $i=\log_2(\frac{1}{p})$ bits can be safely used to represent this outcome in any event.

Now, the information about an outcome that goes from the sender to receiver is actually the codeword that represents the outcome. And we just saw how the length of the codeword is determined by $\log_2(\frac{1}{p})$. So, we choose to call this special length $i$ - the information of the event, and that's how $\log$ comes into picture naturally.

$ p=0.25 \Rightarrow i = 2 $ means that we need $ 2 $ bits for encoding this outcome in any event.

$ p=0.125 \Rightarrow i = 3 $ means that we need ( 3 ) bits for encoding this outcome in any event.

Finally, what would be the information content of any event in total, that is, the information of all the outcomes combined? In other words, what is the information content of a system that can have different states with different probabilities? The answer is that each outcome or state adds its information to the system but only in the ratio of how much of it is there - i.e. its probability. This is just verbiage for the Entropy equation: $$\begin{align} H = & \sum_i{p_i.i} \\[6pt] = & - \sum_i{p_i \log_2({p_i}}) \end{align}$$

The above has been explained in more detail here.