Solution 1:

Let the total number of events be $K$ and let $X_1, X_2, \ldots X_N$ be a random sample of events where $X_i = j$ means that the sample $X_i$ corresponds to event $j$ and the probability of happening of event $j$ be $p_j$ (note that $j\leq K$). The estimated value of $p_j$ be $\hat{p_j}$ which is given by,

$\hat{p_j} = \frac{\sum\limits_{i=1}^{N}\mathbb{I}(X_i=j)}{N}$, where $\mathbb{I}$ is the indicator function.

Let us denote $\vec{T}=(\hat{p_1}, \hat{p_2}, \ldots, \hat{p_K})$, $\vec{\theta} = (p_1, p_2, \ldots, p_K)$ and $g(\vec{x}) = \sum\limits_{i=1}^{K}x_i\log\frac{1}{x_i}$, then using the first order taylor approximation, we get,

$g(\vec{T}) \approx g(\vec{\theta}) + \sum\limits_{i=1}^{K}g^{'}_{i}(\vec{\theta})(T_i-\theta_i)$

and therefore,

$E(g(\vec{T})) \approx g(\vec{\theta})$, because $E(T_i) = \theta_i$

$Var(g(\vec{T})) \approx E((g(\vec{T})-g(\vec{\theta}))^2) \\ \approx E((\sum\limits_{i=1}^{K}g^{'}_{i}(\vec{\theta})(T_i-\theta_i))^2)\\ = \sum\limits_{i=1}^{K}(g^{'}_{i}(\theta))^2Var(T_i) + 2\sum\limits_{i>j}g^{'}_{i}(\theta)g^{'}_{j}(\theta)Cov(T_i,T_j)$

Assuming zero covariance between $T_i$ and $T_j$, we get,

$Var(g(\vec{T})) \approx \sum\limits_{i=1}^{K}(g^{'}_{i}(\theta))^2Var(T_i) \\ = \sum\limits_{i=1}^{K}(\log(\frac{1}{p_i})-1)^2.\frac{p_{i}(1-p_{i})}{N} \\ = \sum\limits_{i=1}^{K}(\log(p_i)+1)^2.\frac{p_{i}(1-p_{i})}{N}$

Now, to get an estimate of above variance, substitute $\hat{p_i}$ with $p_i$. Note that the above method gives an approximate value of the variance of estimate of information entropy. To compute the confidence interval, one needs to know the distribution of $g(\vec{T})$. If the sample size is very large i.e. $N \rightarrow \infty$, then by using the multivariate extension of transformation based central limit theorem, one can show in our case that,

$\sqrt{N}(g(\vec{T}) - g(\vec{\theta})) \rightarrow \mathcal{N}(0,\sum\limits_{i=1}^{K}(\log(p_i)+1)^2.\frac{p_{i}(1-p_{i})}{N})$

and thereafter it is easy to compute any confidence interval for $g(\vec{T})$.

Let me know if something is not clear.

Solution 2:

Basharin answered this (implicitly) in "On a Statistical Estimate for the Entropy of a Sequence of Independent Random Variables". Teor Veroyatnost i Primenen. 1959;4(3):361–364. English version at https://epubs.siam.org/doi/10.1137/1104033.

He calculated bias and variance for the estimator of Shannon entropy given above, and showed that the estimator is consistent and asymptotically normal estimate. Thus for a large enough data set, one should be able to use approximate Gaussian confidence intervals. Basharin's variance estimator is $${\mbox{Var}}\left[\hat{H}\right] = \frac{1}{N} \left[ \sum_{i=1}^s p_i \big( \log_2 p_i \big)^2 - H^2 \right] + O\left( \frac{1}{N^2} \right). $$

Basharin also calculates the bias though, which is important for smaller samples, however in these cases Gaussianity may be a poor approximation.