How is logistic loss and cross-entropy related?

I found that Kullback-Leibler loss, log-loss or cross-entropy is the same loss function. Is the logistic-loss function used in logistic regression equivalent to the cross-entropy function? If yes, can anybody explain how they are related?

Thanks


The relationship between Cross-entropy, logistic loss and K-L divergence is quite natural and immersed in the definition itself.

Cross-entropy is defined as: \begin{equation} H(p, q) = \operatorname{E}_p[-\log q] = H(p) + D_{\mathrm{KL}}(p \| q)=-\sum_x p(x)\log q(x) \end{equation} Where, $p$ and $q$ are two distributions and using the definition of K-L divergence. $H(p)$ is the entropy of p. Now if $p \in \{y,1-y\}$ and $q \in \{\hat{y}, 1-\hat{y}\}$, we can re-write cross-entropy as: \begin{equation} H(p, q) = -\sum_x p_x \log q_x =-y\log \hat{y}-(1-y)\log (1-\hat{y}) \end{equation} which is nothing but logistic loss. Further, log loss is also related to logistic loss and cross-entropy as follows:

Expected Log loss is defined as follows: \begin{equation} E[-\log q] \end{equation} Note the above loss function used in logistic regression where q is a sigmoid function. Excess risk for the above loss function is defined as follows: \begin{equation} E[\log p - \log q ]=E[\log\frac{p}{q}]=D_{KL}(p||q) \end{equation} Notice that the K-L divergence is nothing but the excess risk of the log loss and K-L differs from Cross-entropy by a constant factor (see the first definition). One important thing to remember is that we usually minimize the log loss instead of the cross-entropy in logistic regression which is not perfectly OK but it is in practice.


yes they are related.
the cross entropy used in logistic regression is derived from the Maximum Likelihood principle (or equivalently minimise (- log(likelihood))). see section 28.2.1 Kullback-Liebler divergence:

Suppose ν and µ are the distributions of two probability models, and ν << µ. Then the cross-entropy is the expected negative log-likelihood of the model corresponding to ν, when the actual distribution is µ