What is the motivation for using cross-entropy to compare two probability vectors?
Define a "probability vector" to be a vector $p = (p_1,\ldots, p_K) \in \mathbb R^K$ whose components are nonnegative and which satisfies $\sum_{k=1}^K p_k = 1$. We can think of a probability vector as specifying a probability mass function (PMF) for a random variable with $K$ distinct possible values.
A straightforward and intuitive way to compare two vectors $p$ and $q$ in $\mathbb R^K$ is to compute the quantity $$ d(p,q) = \frac12 \| p - q \|_2^2, $$ which is small when $p$ is close to $q$. However, if $p$ and $q$ are probability vectors, I think it is somehow more natural to compare them using the "cross-entropy loss function" $\ell$ defined by
$$ \ell(p,q) = -\sum_{k=1}^K q_k \log(p_k). $$ (This function is only defined when all components of $p$ are nonzero.)
Question: What is the motivation for using the cross-entropy loss function when comparing probability vectors? Is there a viewpoint that makes it directly obvious that this is the "correct" thing to do?
Some additional background information:
This method of comparing probability vectors is fundamental in machine learning, because we have the following "recipe" for a classification algorithm which classifies objects into one of $K$ distinct classes. Suppose that we are given a list of training examples $x_i \in \mathbb R^n$ and corresponding one-hot encoded label vectors $y_i \in \mathbb R^K$. (So if the $i$th training example belongs to class $k$, then the $k$th component of the vector $y_i$ is $1$ and the other components are $0$.) Let $S: \mathbb R^K \to \mathbb R^K$ be the softmax function defined by $$ S(u) = \begin{bmatrix} \frac{e^{u_1}}{\sum_k e^{u_k}} \\ \vdots \\ \frac{e^{u_K}}{\sum_k e^{u_k}} \end{bmatrix}. $$ The softmax function is useful because it converts a vector in $\mathbb R^K$ into a probability vector. To develop a classification algorithm, we attempt to find a function $f: \mathbb R^n \to \mathbb R^K$ such that for each training example $x_i$ the probability vector $p_i = S(f(x_i))$ is close to $y_i$ in the sense that $\ell(p_i, y_i)$ is small. For example, $f$ might be a neural network with a particular architecture, and the parameter vector $\theta$ which contains the weights of the neural network is chosen to minimize $$ \sum_{i = 1}^N \ell(p_i, y_i), $$ where $N$ is the number of training examples. (Multiclass logistic regression is the especially simple case where $f$ is assumed to be affine: $f(x_i) = A x_i + b$.)
One way to discover the cross-entropy loss function is to go through the steps of using maximum likelihood estimation to estimate the parameter vector $\theta$ which specifies $f$ (assuming that $f$ is restricted to be a member of a certain parameterized family of functions, such as affine functions or neural networks with a particular architecture). The cross-entropy loss function just pops out of the MLE procedure. This is the approach that currently seems the most clear to me. There is also an information theory viewpoint.
Is there any simple way to recognize that the cross-entropy loss function is a "natural" way to compare probability vectors?
Let me try with the following three-step reasoning process.
To measure probability value difference
Intuitively, what is best way to measure difference between two probability values?
The probability of a person's death is related to car accident is about $\frac{1}{77}$, and the odds of one stricken by lightening is about $\frac{1}{700,000}$. Their numerical difference (in terms of L2) is around 1%. Do you consider the two events similarly likely? Most people in this case might consider the two events are very different: the first type of events is rare but significant and worthy of attention, while most would not worry about the second type of events in their normal days.
Overall, the sun shines 72% of the time in San Jose, and about 66% of the time on the sunny side (bay side) of San Francisco. The two sun shine probabilities differ numerically by about 6%. Do you consider the difference significant? For some, it might be; but or me, both places get plenty of sun shine, and there is little material difference.
The take away is that we need to measure individual probability value difference not by subtraction, but by some sort of quantities related to their ratio $\frac{p_k}{q_k}$.
But there are problems with using ratio as the measurement quantity. One problem is that it could vary a lot, especially for rare events. It is not uncommon for one to assess a certain probability to be 1% the first day, and declare it to be 2% the second day. Taking a simple ratio of the probability values to probability value of another event would lead to the measurements to change by 100% between the two days. For this reason, the log of ratio $\ log(\frac{p_k}{q_k})$ is used for measuring difference between individual pair of probability values.
To measure probability distribution difference
The goal of your question is to measure the distance between two probability distributions, not two individual probability value points. For a probability distribution, we are talking about multiple probability value points. To most people, it should makes sense to first compute the difference at each probability value point, and then to take their average (weighted by their probability values, i.e. $p_k log(\frac{p_k}{q_k})$) as the distance between two probability distributions.
This leads to our first formula for measuring distribution differences. $$ D_{KL}(p \Vert q) = \sum_{k=1}^n p_k log\left( \frac{p_k}{q_k} \right). $$ This distance measure, called KL-divergence, (not a metric) is usually much better than L1/L2 distances, especially in the realm of Machine Learning. I hope, by now, you would agree that KL-divergence is a natural measure for probability distribution differences.
Finally the cross-entropy measure
There are two technical facts one needs to be aware.
First, KL-divergence and cross entropy is related by the following formula. $$ D_{KL}(p \Vert q) = H(p, q) - H(p). $$
Second, in ML practice, we often pass the ground truth label as the $p$ parameter and the model inference outputs as the $q$ parameter. And in majority of the cases, our training algorithms are based on gradient descent. If both of our assumptions are true (most likely), the term $H(p)$ term is a constant that does not affect our training results, and hence can be discarded to save computational resources. In this case, $H(p,q)$, the cross-entropy, can be used in place of $D_{KL}(p \Vert q)$.
If the assumptions are violated, you need to abandon the cross-entropy formula and revert back to the KL-divergence.
I think I can now end my wordy explanation. I hope it helps.
Here is a "maximum likelihood estimation" viewpoint which is simple and clear, and which does not require any knowledge of information theory.
Imagine a $K$-sided die whose faces are labeled with integers from $1$ to $K$. The die is biased so that when we roll it, the probability that the result is $k$ is $p_k$. However, person $Q$ believes that the probability that the result is $k$ is $q_k$ (for $k = 1, \ldots, K$).
We roll the die $N$ times, where $N$ is a large positive integer. Let $y_i$ be the result of the $i$th roll, and let $N_k$ be the number of times that the die lands on face $k$. Person $Q$ would say that the probability of observing this particular sequence of values $y_1, \ldots, y_N$ is $$ L = \Pi_{k=1}^K q_k^{N_k}. $$ If $L$ is close to $1$, then person $Q$ is not very surprised by the results of our $N$ observations, so in that sense the probability vector $q = (q_1, \ldots, q_K)$ is consistent with the probability vector $p = (p_1, \ldots, p_K)$.
But note that $$ \frac{\log(L)}{N} = \sum_{k=1}^K \frac{N_k}{N} \log(q_k) \approx \sum_{k=1}^K p_k \log(q_k). $$ The approximation is good when $N$ is large. So we see that the quantity $H(p,q) = \sum_{k=1}^K p_k \log(q_k)$ can be used to measure the consistency of $p$ and $q$. The larger $H(p,q)$ is, the closer $L$ is to $1$. In other words, the larger $L$ is, the less surprised person $Q$ is by the results of our die rolls.