Why does the Elo rating system work?
The Elo rating system is used to rank players in games such as chess. I can find plenty of explanations online of how to compute someone's Elo rating, how to actually crunch the numbers in practice, but I can't find a single clear conceptual explanation of what the rating is supposed to mean and why.
The only information I can find is that apparently the Elo rating of two players allows you to calculate the odds that one player will win against the other. But every page I've been able to find that talks about this just drops the formula for how to calculate these odds on you and says "there you go, that gives the probability of winning", without explaining why. Wikipedia mentions something about the assumption that "chess performance is normally distributed", but doesn't go any further.
What is the underlying probabilistic model for two-player games that the Elo system is based on? What are its basic assumptions, and what is the proof, from those assumptions, that the Elo system does indeed allow you to calculate win probabilities?
The key point about the Elo rating is that it is related to the log-odds of players winning games.
It assumes that there is a relationship across players, so that (ignoring the possibility of draws) if Player B is $10$ times as likely to beat Player A as Player A is to be beat Player $B$, and Player C is $10$ times as likely to beat Player B as Player B is to beat Player C, then Player C is $100$ times as likely to beat Player A as Player A is to beat Player C.
The Elo rating is scaled so that (ignoring the possibility of draws) if Player B is $10$ times as likely to beat Player A as Player A is to beat Player B then the Elo rating of Player B should be $400$ higher than the Elo rating of Player A. Combining this with the earlier assumption has the result that, if Player C is $100$ times as likely to beat Player A as Player A is to beat Player C, then the Elo rating of Player C should be $800$ higher than the Elo rating of Player A: each linear increase in the difference of Elo ratings of $400$ multiplies the odds of the better player winning by a factor of $10$, so this is a logarithmic relationship.
Putting these together means that the prediction based on Elo ratings $R_A$ and $R_B$ gives $$400 \log_{10}(\text{Odds}(\text{B beats A})) = {R_B-R_A} $$ and that implies $$\text{Odds}(\text{B beats A}) = \dfrac{\Pr(\text{B beats A})}{\Pr(\text{A beats B})} = 10^{(R_B-R_A)/400} $$ and combining these with ${\Pr(\text{B beats A})}+{\Pr(\text{A beats B})}=1$ would give a probability prediction $$\Pr(\text{B beats A}) = \dfrac{10^{(R_B-R_A)/400}}{10^{(R_B-R_A)/400}+1} =\dfrac{1}{1+10^{(R_A-R_B)/400}}$$ and a predicted expected net result for Player B of $$\Pr(\text{B beats A}) - \Pr(\text{A beats B}) = \dfrac{10^{(R_B-R_A)/400}-1}{10^{(R_B-R_A)/400}+1} =\dfrac{1-10^{(R_A-R_B)/400}}{1+10^{(R_A-R_B)/400}}$$
The Elo score then has two further useful features: first a mechanism for adjusting scores when results are not as expected (and a $K$ factor which attempts to balance the desire that incorrect scores should adjust as quickly as possible against a desire not to have too much volatility in scores); and second a method to address competitions which are not just win-lose, by focussing on expected net results from a contest rather than just the odds and probabilities of wins and losses.
Here are two very interesting articles from Mr. Mark Glickman, who is a statistic professor at Harvard University. I think it answers to your questions:
http://glicko.net/research/chance.pdf
http://www.glicko.net/research/acjpaper.pdf
A couple of minor points augmenting some of the above answers.
... and what is the proof, from those assumptions, that the Elo system does indeed allow you to calculate win probabilities?
what is the proof that a $\chi$ or Lorenzian or Gaussian distribution allow you to calculate favorable probabilities?
There is no proof. You assume that your probabilities are given by the given distribution (in Henry's answer) and then (and after your distribution model is "sufficiently validated" by actual results), you just make the same simple test that you do in Statistics 101.
What "sufficiently validated" means: That you first choose your model to be some common distribution like the Normal one and then check this by recording the distribution of the play results, using the actual ELO points of the players.
If a "skewness" (or bias) results between the distribution of the actual results and the assumed Normal, this very bias will show you how to adjust your old distribution to get one that will be closer to experimental results (Your comment).
So you iterate and refine. Each time you iterate, the new bias will point closer to a better distribution you should use for your next iteration.
The current ELO distribution then is nothing more than a finitely-timed approximant of the distribution results you get from actual trial runs with players with actual ELO ratings, for some small bias $\epsilon>0$.
Now assuming the whatever ELO approximant, your test is as simple as putting a vertical bar on the graph of the distribution. I.e., with:
restart;
with(plots):
AR:=2500;
AEF := proc (x) options operator, arrow;
1/(1+10^((1/400)*x-(1/400)*AR)) end proc;
plot(AEF(x), x = 1000 .. 3500, color = red);
The curve below is the prediction for a x ELO player losing to a 2500 ELO player. The distribution shows you your chances of losing.
Some of the advantages of the ELO model with chess
The good thing with this distribution (as with any distribution that models sufficiently accurately some phenomenon) is that it tells you many interesting things about the game it models just by looking at it:
First of all, you notice that it is a very "dense" distribution. Like the Fermi distribution of degenerate electrons in the core of a Neutron star. It has an abrupt fallout and its appeareance indicates a fairly large subset base (large compact support), where the distribution remains almost constant.
If you interpret some of the above features correctly (looking at the corresponding items like expectation, momentum, etc.), you can the make some fairly solid probabilistic statements, some of which may prove to be true.
For example, the particular abrupt fallout, tells you that much more effort is required to cover the distance between 2000 and 2700, than to cover the distance between 1000 and 2000 (with integrals from a to b as work, etc)
The fairly large compact support on the other hand (1000-2000), reveals something which is, surprisingly true. That the really good players are relatively rare :*)