What is the relationship between the Boltzmann distribution and information theory?
I'm reading a paper on Boltzmann machines (a type of neural network in Machine Learning), and it mentions that "The Boltzmann distribution has some beautiful mathematical properties and it is intimately related to information theory."
What's the nature of this relationship (between the Boltzmann distribution and information theory), and what are these "beautiful mathematical properties"?
High-level answers are fine (perhaps even preferred, as I don't know much about statistical physics, though I do have some background in information theory; but if it's hard to give a high-level overview, technical explanations are also great).
Solution 1:
In both information theory and physics there is a fundamental quantity called entropy associated with every measure. For the sake of simplicity, let me assume that the number of possible events is finite so that I don't have to get into technicalities of the measure theory. In this case you can describe the probability measure by a function that assigns to a result number $n$ probability $p_n$ (e.g. for possible outcomes of rolling a die you'd have $p_n = 1/6$ for $1 \leq n \leq 6$).
To such a measure one assigns the entropy $$S = - \sum_n p_n \log p_n$$ To get an intuitive feel for this quantity, consider some simple measures. In the case of deterministic process ($p_1 = 1$ and $p_n = 0$ otherwise) we get $S = - 1 \log 1 = 0$. For the die we get $S = - 6 {1 \over 6} \log {1 \over 6} = \log 6$. Try to play with some more distributions but the bottom line is that entropy measures uncertainty in the distribution: it is zero when we are certain and it is maximal when we have no clue.
The reason this is important in physics is that there is the second law of thermodynamics that states that the entropy of any closed system cannot decrease. It's one of the most fundamental laws of our world and basically it says that unbroken eggs tend to break (and broken eggs do not tend to fix themselves) and the mess in your room will not diminish on its own.
Other way to state the same thing is that systems left on their own evolve towards some equilibrium, which is a state of maximal entropy, and then stay there. What this means is that we obtain a variational princpiple: the measure describing a physical system is such that its entropy is maximized. You can check for yourself that if there are no constraints in the system then such a maximizer on $n$ results is given by $p_n = 1/n$ (i.e. predicting a result of a die roll has the maximum entropy; this is precisely because we have a priori no clue what result we'll obtain).
On the other hand, there are constraints in physical systems. The fundamental one is energy. For a closed system its energy will be conserved. Now, in a real world, systems we deal with are rarely closed. Usually they are with contact with their surroundings that we are not interested in. So we have a new problem set upon us: determine a probability distribution on a system in equilibrium with its environment if the total energy $E$ is conserved. Turns out (surprise!) this is precisely the Boltzmann distribution $p_n = {1 \over Z} \exp\left(-{E_n \over k_B T}\right)$ which is characterized by temperature $T$ arising from the constraint in the variational problem and is normalized by partition function $Z(T) = \sum_n \exp\left({E_n \over k_B T}\right)$. The last constant $k_B$ is there just to make the units right, so don't worry about it.
As for the beautiful properties of the Boltzmann distribution, it's hard to know where to start. For one thing, one can obtain lots of information from the partition function $Z$. But first let's agree to use inverse temperature $\beta \equiv {1 \over k_B T}$ which is a more natural variable. Then we have e.g. $$ -{\partial \ln Z \over \partial \beta} = \sum_n E_n p_n = \left< E \right>$$ which is precisely an average energy of the system. We can similarly compute other physical quantities.
There's much more to be said, especially in connection with phase transitions (which occur when the solution to the variational problem is not unique) but I hope this will satisfy you for now.
Solution 2:
I think if you want a nice introduction to the link between Boltzman entropy and information theory, the works of E.T. Jaynes are a good start. The key to understand the link is his Principle of Maximum Entropy.
Here's a classic paper by Jaynes about information theory and statistical mechanics.