What is a good solution for calculating an average where the sum of all values exceeds a double's limits?

Solution 1:

You can calculate the mean iteratively. This algorithm is simple, fast, you have to process each value just once, and the variables never get larger than the largest value in the set, so you won't get an overflow.

double mean(double[] ary) {
  double avg = 0;
  int t = 1;
  for (double x : ary) {
    avg += (x - avg) / t;
    ++t;
  }
  return avg;
}

Inside the loop avg always is the average value of all values processed so far. In other words, if all the values are finite you should not get an overflow.

Solution 2:

The very first issue I'd like to ask you is this:

  • Do you know the number of values beforehand?

If not, then you have little choice but to sum, and count, and divide, to do the average. If Double isn't high enough precision to handle this, then tough luck, you can't use Double, you need to find a data type that can handle it.

If, on the other hand, you do know the number of values beforehand, you can look at what you're really doing and change how you do it, but keep the overall result.

The average of N values, stored in some collection A, is this:

A[0]   A[1]   A[2]   A[3]          A[N-1]   A[N]
---- + ---- + ---- + ---- + .... + ------ + ----
 N      N      N      N               N       N

To calculate subsets of this result, you can split up the calculation into equally sized sets, so you can do this, for 3-valued sets (assuming the number of values is divisable by 3, otherwise you need a different divisor)

/ A[0]   A[1]   A[2] \   / A[3]   A[4]   A[5] \   //      A[N-1]   A[N] \
| ---- + ---- + ---- |   | ---- + ---- + ---- |   \\    + ------ + ---- |
\  3      3      3   /   \  3      3      3   /   //        3       3   /
 --------------------- +  --------------------  + \\      --------------
          N                        N                        N
         ---                      ---                      ---
          3                        3                        3

Note that you need equally sized sets, otherwise numbers in the last set, which will not have enough values compared to all the sets before it, will have a higher impact on the final result.

Consider the numbers 1-7 in sequence, if you pick a set-size of 3, you'll get this result:

/ 1   2   3 \   / 4   5   6 \   / 7 \ 
| - + - + - | + | - + - + - | + | - |
\ 3   3   3 /   \ 3   3   3 /   \ 3 /
 -----------     -----------     ---
      y               y           y

which gives:

     2   5   7/3
     - + - + ---
     y   y    y

If y is 3 for all the sets, you get this:

     2   5   7/3
     - + - + ---
     3   3    3

which gives:

2*3   5*3    7
--- + --- + ---
 9     9     9

which is:

6   15   7
- + -- + -
9    9   9

which totals:

28
-- ~ 3,1111111111111111111111.........1111111.........
 9

The average of 1-7, is 4. Obviously this won't work. Note that if you do the above exercise with the numbers 1, 2, 3, 4, 5, 6, 7, 0, 0 (note the two zeroes at the end there), then you'll get the above result.

In other words, if you can't split the number of values up into equally sized sets, the last set will be counted as though it has the same number of values as all the sets preceeding it, but it will be padded with zeroes for all the missing values.

So, you need equally sized sets. Tough luck if your original input set consists of a prime number of values.

What I'm worried about here though is loss of precision. I'm not entirely sure Double will give you good enough precision in such a case, if it initially cannot hold the entire sum of the values.

Solution 3:

Apart from using the better approaches already suggested, you can use BigDecimal to make your calculations. (Bear in mind it is immutable)

Solution 4:

IMHO, the most robust way of solving your problem is

  1. sort your set
  2. split in groups of elements whose sum wouldn't overflow - since they are sorted, this is fast and easy
  3. do the sum in each group - and divide by the group size
  4. do the sum of the group's sum's (possibly calling this same algorithm recursively) - be aware that if the groups will not be equally sized, you'll have to weight them by their size

One nice thing of this approach is that it scales nicely if you have a really large number of elements to sum - and a large number of processors/machines to use to do the math