Standard Deviation in LINQ
Does LINQ model the aggregate SQL function STDDEV()
(standard deviation)?
If not, what is the simplest / best-practices way to calculate it?
Example:
SELECT test_id, AVERAGE(result) avg, STDDEV(result) std
FROM tests
GROUP BY test_id
Solution 1:
You can make your own extension calculating it
public static class Extensions
{
public static double StdDev(this IEnumerable<double> values)
{
double ret = 0;
int count = values.Count();
if (count > 1)
{
//Compute the Average
double avg = values.Average();
//Perform the Sum of (value-avg)^2
double sum = values.Sum(d => (d - avg) * (d - avg));
//Put it all together
ret = Math.Sqrt(sum / count);
}
return ret;
}
}
If you have a sample of the population rather than the whole population, then you should use ret = Math.Sqrt(sum / (count - 1));
.
Transformed into extension from Adding Standard Deviation to LINQ by Chris Bennett.
Solution 2:
Dynami's answer works but makes multiple passes through the data to get a result. This is a single pass method that calculates the sample standard deviation:
public static double StdDev(this IEnumerable<double> values)
{
// ref: http://warrenseen.com/blog/2006/03/13/how-to-calculate-standard-deviation/
double mean = 0.0;
double sum = 0.0;
double stdDev = 0.0;
int n = 0;
foreach (double val in values)
{
n++;
double delta = val - mean;
mean += delta / n;
sum += delta * (val - mean);
}
if (1 < n)
stdDev = Math.Sqrt(sum / (n - 1));
return stdDev;
}
This is the sample standard deviation since it divides by n - 1
. For the normal standard deviation you need to divide by n
instead.
This uses Welford's method which has higher numerical accuracy compared to the Average(x^2)-Average(x)^2
method.
Solution 3:
This converts David Clarke's answer into an extension that follows the same form as the other aggregate LINQ functions like Average.
Usage would be: var stdev = data.StdDev(o => o.number)
public static class Extensions
{
public static double StdDev<T>(this IEnumerable<T> list, Func<T, double> values)
{
// ref: https://stackoverflow.com/questions/2253874/linq-equivalent-for-standard-deviation
// ref: http://warrenseen.com/blog/2006/03/13/how-to-calculate-standard-deviation/
var mean = 0.0;
var sum = 0.0;
var stdDev = 0.0;
var n = 0;
foreach (var value in list.Select(values))
{
n++;
var delta = value - mean;
mean += delta / n;
sum += delta * (value - mean);
}
if (1 < n)
stdDev = Math.Sqrt(sum / (n - 1));
return stdDev;
}
}
Solution 4:
var stddev = Math.Sqrt(data.Average(z=>z*z)-Math.Pow(data.Average(),2));
Solution 5:
Straight to the point (and C# > 6.0), Dynamis answer becomes this:
public static double StdDev(this IEnumerable<double> values)
{
var count = values?.Count() ?? 0;
if (count <= 1) return 0;
var avg = values.Average();
var sum = values.Sum(d => Math.Pow(d - avg, 2));
return Math.Sqrt(sum / count);
}
Edit 2020-08-27:
I took @David Clarke comments to make some performance tests and this are the results:
public static (double stdDev, double avg) StdDevFast(this List<double> values)
{
var count = values?.Count ?? 0;
if (count <= 1) return (0, 0);
var avg = GetAverage(values);
var sum = GetSumOfSquareDiff(values, avg);
return (Math.Sqrt(sum / count), avg);
}
private static double GetAverage(List<double> values)
{
double sum = 0.0;
for (int i = 0; i < values.Count; i++)
sum += values[i];
return sum / values.Count;
}
private static double GetSumOfSquareDiff(List<double> values, double avg)
{
double sum = 0.0;
for (int i = 0; i < values.Count; i++)
{
var diff = values[i] - avg;
sum += diff * diff;
}
return sum;
}
I tested this with a list of one million random doubles
the original implementation had an runtime of ~48ms
the performance optimized implementation 2-3ms
so this is an significant improvement.
Some interesting details:
getting rid of Math.Pow brings a boost of 33ms!
List instead of IEnumerable 6ms
manually Average calculation 4ms
For-loops instead of ForEach-loops 2ms
Array instead of List brings just an improvement of ~2% so i skipped this
using single instead of double brings nothing
Further lowering the code and using goto (yes GOTO... haven't used this since the 90s assembler...) instead of for-loops
does not pay, Thank goodness!
I have tested also parallel calculation, this makes sense on list > 200.000 items It seems that Hardware and Software needs to initialize a lot and this is for small lists contra-productive.
All tests were executed two times in a row to get rid of the warmup-time.