How can I calculate "most popular" more accurately?

Solution 1:

A standard procedure (frequently -and loosely- called 'bayesian average') is to make a weighted average between the individual rating and the 'a priori' rating:

$R_a = W \; R + (1 - W ) \; R_0$

where

$R_a = $ averaged ('bayesian') rating

$R = $ individual rating: average rating for this item.

$R_0 = $ a priori rating: global average rating, for all items in your database.

$W = $ weight factor: it should tend to $0$ if this items has few votes, and it should tend to $1$ if it has many.

Some choices: $W = \frac{n}{N_{max}}$, or $W = max( \alpha \frac{n}{N_{av}},1)$ , etc ($n=$ number of votes for this item, $N_{max}=$ maximum number of votes for all items, $N_{av}=$average, $\alpha=$ some number between 0.5 and 1... ) Also, frequently one discards items that have very low/big values when computing the statistics.

See some examples

Added: for another approach, specially for yes/no like/diskike votes, see here.

Solution 2:

You could consider each post to have a 'true' average rating which you zero in on as more users vote on it. If we consider that the votes on each post come from a set of all possible votes that could have been cast that have a 'true' mean $\mu$ and standard deviation $\sigma$, then your average of the votes actually cast can be considered as an estimator of the true value. This estimator can be more or less accurate, depending on the number of votes cast. So your question is "how do I take account of the inaccuracies of my measured averages in the ranking of posts?"

We can solve this with a little math. Let vote $i$ be denoted $x_i=1,\dots,5$. Start by giving each post a rating of 3, so $x_1=3$. Then for each subsequent rating calculate the mean

$$\mu_{(n)} = \frac{1}{n}\sum_{i=1}^n x_i$$

and the sample standard deviation

$$\sigma_{(n)} = \frac{1}{n-1} \sum_{i=1}^n (x_i - \mu_{(n)})^2$$

Then the standard deviation of your measured mean $\mu_{(n)}$ (which is a measure of its accuracy) is given by

$$\hat{\sigma}_{(n)} = \frac{\sigma_{(n)}}{\sqrt{n}}$$

i.e. the quality of your estimate of the true mean increases as the number of votes increases, which is what you'd expect.

To translate this into a ranking, note that knowing the standard deviation of the mean gives you an approximate confidence interval for the mean. In particular, you can calculate a minimum value for the true mean, with a certain level of confidence. A 95% confident estimate for the minimum value of the mean is given by $\mu_{(n)} - 1.64\hat{\sigma}_{(n)}$. Loosely translated, you can be 95% sure that the true mean is above $\mu_{(n)} - 1.64\hat{\sigma}_{(n)}$.

Ranking your posts by this quantity rather than by the measured mean naturally takes into account your uncertainty for posts with only a small number of votes.


One valid criticism of this scheme is that it is expensive to recalculate the mean and variance every time someone submits a vote. You may find the following on-line algorithms for calculating mean and variance useful: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm

Solution 3:

While Chris Taylor's answer is already good, I feel I should say something about statistical estimation. The idea is that your data are observations of a random variable $X$, and you want to say something about the distribution of $X$ using this data. There are two main types of models in statistics, namely parametric and non-parametric models. For example, plotting your data as a histogram is an example of non-parametric method. While non-parametric methods are robust, it is sometimes more difficult to obtain information about $X$ using them, so we often make assumptions about the distribution of $X$ and use a parametric model instead. An example of parametric modelling would be to assume that the users' votes are normally distributed with some unknown mean $\mu$ and variance $\sigma^2$ and try to estimate $\mu$ and $\sigma^2$ from the sample data. One advantage of making such an assumption is that we can give good confidence intervals for our estimates.

Indeed, if we have samples $X_1, \ldots, X_n$ of $X \sim N(\mu, \sigma^2)$, then, the probability that the sample mean $\overline{X} = \frac{1}{n} (X_1 + \cdots + X_n)$ is contained in the interval $(\mu - a, \mu + a)$ is $$\Phi \left(\frac{a}{\sigma \sqrt{n}}\right) - \Phi \left(-\frac{a}{\sigma \sqrt{n}}\right)$$ where $\Phi$ is the cumulative distribution function of the standard normal distribution. If we set $a = 1.64 \sigma \sqrt{n}$ then we see that there is a $90\%$ chance that $\overline{X} \in (\mu - a, \mu + a)$. Equivalently, there is a $90\%$ chance that $\mu \in (\overline{X} - a, \overline{X} + a)$. Unfortunately, I've cheated here a little — we don't know what $\sigma^2$ is, so we can't calculate $a$. So now we have to estimate $\sigma^2$. It's a known fact that $$S = \sum_{i=1}^{n} \frac{(X_i - \overline{X})^2}{\sigma^2}$$ is a $\chi^2$ random variable with $n - 1$ degrees of freedom, and is independent of $\overline{X}$. Hence, $$\mathbb{E} \left[ \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \overline{X})^2 \right] = \sigma^2$$ i.e. $\tilde{\sigma}^2 = \frac{1}{n - 1} S$ is an unbiased estimator of $\sigma^2$. We could, in principle, compute a confidence interval for $\sigma^2$ as well, but we only need a point estimate, and we can just substitute $\tilde{\sigma}$ for $\sigma$ in the formula above for the confidence interval for $\mu$ to obtain a concrete answer. (But note that it isn't a $90\%$ confidence interval!)

Of course, one can iterate this game, and (loosely speaking) Bayesian statistics is what you get when you assume that your parameters themselves are random variables distributed according to some prior distribution.