Estimating the entropy

Solution 1:

Estimating entropy is not an easy problem and have been a subject of research for years.

There is no unbiased estimator for entropy [Paninski 2003].
There are plenty of good entropy estimators that have low bias and/or low variance.

Here's a partial list for the estimators I think are good:

Paninski. Estimation of Entropy and Mutual Information. Neural Computation, Vol. 15, No. 6. (1 June 2003), pp. 1191-1253
Vincent Q. Vu, Bin Yu, Robert E. Kass. Coverage-adjusted entropy estimation. Statist. Med., Vol. 26, No. 21. (2007), pp. 4039-4060, doi:10.1002/sim.2942
Ilya Nemenman, Fariel Shafee, William Bialek. Entropy and inference, revisited PRE (9 Jan 2002)
Evan Archer, Il Memming Park, Jonathan Pillow. Bayesian Entropy Estimation for Countable Discrete Distributions (arXiv) (Disclaimer: this is my paper)
Valiant and Valiant. Estimating the Unseen: Improved Estimators for Entropy and other Properties. NIPS 2013 (link)

For uniform distribution, CAE works very well, and also Valiant & Valiant should work well too. A quick and dirty estimator for uniform would be the Ma estimator. And here's my citeulike tag page for entropy estimation, in case you need more papers. :)

EDIT: I made a flowchart! Details in my blog. enter image description here

Solution 2:

It seems that I need to give a more detailed answer in the post but not only refer to the references for the main technical ideas. Let me start with a general picture.

Estimating the entropy, from a statistical perspective, is by no means a unique problem among the problems of estimating functionals of parameters. The reason why it has attracted so much attention till now is that it is important in practice, and also we have a rather poor understanding of the general problem of functional estimation. We believe, a complete and satisfactory solution to the entropy estimation should be based on a general methodology for functional estimation.

Let's first consider functional estimation problems in parametric settings. In this setting, it has long been observed that in general we cannot obtain any estimator that satisfies some finite sample optimality criterion (check the book by Lehmann and Cesalla for details), and we need to go to asymptotics, where we fix the parameter, and let the sample size go to infinity. Interesting, in this regime, a very wide class of functional estimation problems are trivial, and the simple MLE plug-in approach is asymptotically efficient [check the book by van der Vaart on asymptotic statistics, chapter 8]. Because of this, the statistics community have shifted its focus to nonparametric settings, i.e., estimating functionals of an infinite dimensional parameter.

The nonparametric functional estimation problem, turns out to be extremely challenging. We have a fairly good understanding of linear functionals (Donoho and Liu'91, Donoho'94, etc), quite good understanding of quadratic and smooth functionals (Bickel and Ritov'88, Birge and Massart'95, etc), but very limited understanding of nonsmooth functionals. One breakthrough in the nonparametric functional estimation literature is the paper by Lepski, Nemirovski, and Spokoiny in 1999 entitled "on estimating the Lr norm of a regression function", in which they showed a general method based on approximation can help construct nearly minimax optimal estimators for a class of functionals. This idea, was later generalized and applied to estimating the $\ell_1$ norm of a Gaussian mean in Cai and Low'11.

Our work (http://arxiv.org/abs/1406.6956) aims to present a general methodology that can help (hopefully) construct minimax optimal estimators for any functionals. The recipe is very easy to describe.

First, let us think, why the entropy estimation problem is hard? You may say that it is because $H(P) = \sum_{i = 1}^S -p_i \ln p_i$, and the function $f(x) = -x \ln x$ is not differentiable at $x = 0$. If you think along this line, you are absolutely right and have captured the essential feature of the entropy functional. The fact that $-x \ln x$ is not quite smooth at $x =0$ implies that, if the true $p$ is small, we had better have a very good estimate of it. When the true $p$ is large, perhaps the MLE of $p$ is already accurate to make it a good estimate for $f(p)$. The solution follows from this idea:

We first compute the empirical distribution. When we find some entry of the empirical distribution is significantly less than $\ln n/n$, which means it appears very few times, we construct a polynomial that approximates the function $f(x) = -x\ln x$ in the interval $[0, \ln n/n]$ as well as possible, and directly estimate this polynomial instead of the original function. When the entries of the empirical distribution are large enough, we simply plug it into the definition of entropy, and do a first order bias correction. This scheme can be applied to any functionals.

In retrospect, this recipe is the precisely the dual of the shrinkage idea originally proposed by Stein'56. In shrinkage, we improve the MLE by introducing bias and significantly reduce the variance. Here, we improve the MLE by introducing variance and significantly reduce the bias. Shrinkage is usually helpful in cases where the variance is the dominating term, and our recipe is usually helpful in cases where the bias if the dominating term.

We support this recipe with strong theoretical guarantees: we show that this recipe achieves the minimax rates for $H(P)$, and $F_\alpha(P) = \sum_{i = 1}^S p_i^\alpha, \alpha>0$, for any $\alpha>0$. It is the first recipe in the literature that can achieve the minimax rates for all these functionals, and we have also shown that the MLE cannot achieve the minimax rates.

We are in fact more ambitious than obtaining the minimax rates: we want something that actually works in practice, and our applied scientists friends can directly apply it in their experiments. In the experiments section of Our work (http://arxiv.org/abs/1406.6956), we have shown that this new estimator for entropy can help reduce the mean squared error significantly, and has linear computational complexity.

There are some other work going on demonstrating its efficacy in various statistical problems, and we hope applied scientists could soon apply these tools to their work and research.

Solution 3:

We have something exciting to share: recently we have constructed a new class of estimators for entropy (and actually also for nearly every functional of distributions) that have provable optimal performance guarantees with linear complexity. In particular, our estimator for entropy can also achieve the optimal n/log n scaling achieved by Valiant and Valiant, and ours demonstrate superior empirical performances in various experiments. For details please check the following two papers:

http://arxiv.org/abs/1406.6956

http://arxiv.org/abs/1406.6959

Edit: also, we remark that our estimators are agnostic to the knowledge of the alphabet size, which usually is unknown to the statistician. In fact, existing approaches usually need the knowledge, or at least an upper bound of the alphabet size, and even with these knowledge they generally cannot achieve the optimal n/log n scaling.

Solution 4:

This answer does not contradict Memming's answer, but but I think it is worth mentioning:

There is a unbiased estimator for entropy, as discussed in this paper. But for the estimator to work, two conditions must hold:

You must know the number of classes(=values) the random variable can have. (The number of classes must be finite).
Your set of samples must contain at least one sample of each class.

Given that these two conditions hold, the following is an unbiased estimator of entropy: $$\sum_{i=1}^{M}\frac{I_{N_i\geq2}}{N_i-1}$$ Where $M$ is the number of classes, $N_i$ is the index of the first occurrence of a sample whose class is $i$ (here we assume that the samples are ordered, and that the classes are designated by numbers $1,...,M$), and $I_{\bullet}$ is the identity function that equals $1$ if the condition $\bullet$ holds, and equals $0$ otherwise. The proof can be found in the linked paper.

Note that in principle, this estimator requires you have an unlimited number of samples, so that you can ensure that the second condition holds with probability $1$.

EDIT: I guess the right property of this estimator would be "consistent" rather than "unbiased", since it requires infinitely many samples.

Intuitively, why should the coefficient of the derivative of $x^n$ be $n$?

Which version of the Pigeonhole principle is correct? One is far stronger than the other

Exact values of $\cos(2\pi/7)$ and $\sin(2\pi/7)$

Example of a stochastic process which does not have the Markov property

Rotational invariance of cross product

Why isn't every linear subspace of an infinite-dimensional normed space closed?

An interesting definite integral $\int_0^1(1+x+x^2+x^3+\cdot\cdot\cdot+x^{n-1})^2 (1+4x+7x^2+\cdot\cdot\cdot+(3n-2)x^{n-1})~dx=n^3$

How can I sum the infinite series $\frac{1}{5} - \frac{1\cdot4}{5\cdot10} + \frac{1\cdot4\cdot7}{5\cdot10\cdot15} - \cdots\qquad$

Is there a number that's right in the middle of this interval $(0, 1)$?

Why dividing by zero still works

Why is establishing absolute consistency of ZFC impossible?