Standard Deviation in LINQ

Does LINQ model the aggregate SQL function STDDEV() (standard deviation)?

If not, what is the simplest / best-practices way to calculate it?

Example:

  SELECT test_id, AVERAGE(result) avg, STDDEV(result) std 
    FROM tests
GROUP BY test_id

Solution 1:

You can make your own extension calculating it

public static class Extensions
{
    public static double StdDev(this IEnumerable<double> values)
    {
       double ret = 0;
       int count = values.Count();
       if (count  > 1)
       {
          //Compute the Average
          double avg = values.Average();

          //Perform the Sum of (value-avg)^2
          double sum = values.Sum(d => (d - avg) * (d - avg));

          //Put it all together
          ret = Math.Sqrt(sum / count);
       }
       return ret;
    }
}

If you have a sample of the population rather than the whole population, then you should use ret = Math.Sqrt(sum / (count - 1));.

Transformed into extension from Adding Standard Deviation to LINQ by Chris Bennett.

Solution 2:

Dynami's answer works but makes multiple passes through the data to get a result. This is a single pass method that calculates the sample standard deviation:

public static double StdDev(this IEnumerable<double> values)
{
    // ref: http://warrenseen.com/blog/2006/03/13/how-to-calculate-standard-deviation/
    double mean = 0.0;
    double sum = 0.0;
    double stdDev = 0.0;
    int n = 0;
    foreach (double val in values)
    {
        n++;
        double delta = val - mean;
        mean += delta / n;
        sum += delta * (val - mean);
    }
    if (1 < n)
        stdDev = Math.Sqrt(sum / (n - 1));

    return stdDev;
}

This is the sample standard deviation since it divides by n - 1. For the normal standard deviation you need to divide by n instead.

This uses Welford's method which has higher numerical accuracy compared to the Average(x^2)-Average(x)^2 method.

Solution 3:

This converts David Clarke's answer into an extension that follows the same form as the other aggregate LINQ functions like Average.

Usage would be: var stdev = data.StdDev(o => o.number)

public static class Extensions
{
    public static double StdDev<T>(this IEnumerable<T> list, Func<T, double> values)
    {
        // ref: https://stackoverflow.com/questions/2253874/linq-equivalent-for-standard-deviation
        // ref: http://warrenseen.com/blog/2006/03/13/how-to-calculate-standard-deviation/ 
        var mean = 0.0;
        var sum = 0.0;
        var stdDev = 0.0;
        var n = 0;
        foreach (var value in list.Select(values))
        {
            n++;
            var delta = value - mean;
            mean += delta / n;
            sum += delta * (value - mean);
        }
        if (1 < n)
            stdDev = Math.Sqrt(sum / (n - 1));

        return stdDev; 

    }
}

Solution 4:

var stddev = Math.Sqrt(data.Average(z=>z*z)-Math.Pow(data.Average(),2));

Solution 5:

Straight to the point (and C# > 6.0), Dynamis answer becomes this:

    public static double StdDev(this IEnumerable<double> values)
    {
        var count = values?.Count() ?? 0;
        if (count <= 1) return 0;

        var avg = values.Average();
        var sum = values.Sum(d => Math.Pow(d - avg, 2));

        return Math.Sqrt(sum / count);
    }

Edit 2020-08-27:

I took @David Clarke comments to make some performance tests and this are the results:

    public static (double stdDev, double avg) StdDevFast(this List<double> values)
    {
        var count = values?.Count ?? 0;
        if (count <= 1) return (0, 0);

        var avg = GetAverage(values);
        var sum = GetSumOfSquareDiff(values, avg);

        return (Math.Sqrt(sum / count), avg);
    }

    private static double GetAverage(List<double> values)
    {
        double sum = 0.0;
        for (int i = 0; i < values.Count; i++) 
            sum += values[i];
        
        return sum / values.Count;
    }
    private static double GetSumOfSquareDiff(List<double> values, double avg)
    {
        double sum = 0.0;
        for (int i = 0; i < values.Count; i++)
        {
            var diff = values[i] - avg;
            sum += diff * diff;
        }
        return sum;
    }

I tested this with a list of one million random doubles
the original implementation had an runtime of ~48ms
the performance optimized implementation 2-3ms
so this is an significant improvement.

Some interesting details:
getting rid of Math.Pow brings a boost of 33ms!
List instead of IEnumerable 6ms
manually Average calculation 4ms
For-loops instead of ForEach-loops 2ms
Array instead of List brings just an improvement of ~2% so i skipped this
using single instead of double brings nothing

Further lowering the code and using goto (yes GOTO... haven't used this since the 90s assembler...) instead of for-loops does not pay, Thank goodness!

I have tested also parallel calculation, this makes sense on list > 200.000 items It seems that Hardware and Software needs to initialize a lot and this is for small lists contra-productive.

All tests were executed two times in a row to get rid of the warmup-time.