Is there a fast (and accurate) way to calculate the sample variance of a data-set till the n-th element?
A simple way is to use pandas
with expanding
and var(ddof=0)
:
import numpy as np
import pandas as pd
x = np.array([5, 2, 2, 5, 3, 5, 2, 5, 4, 2])
pd.Series(x).expanding().var(ddof=0).to_numpy()
output:
array([0. , 2.25 , 2. , 2.25 , 1.84 ,
1.88888889, 1.95918367, 1.984375 , 1.77777778, 1.85 ])
I would actually also go with the pandas approach. However, one possible solution using pure numpy that doesn't require slicing would be
def running_mean(x:np.array) -> np.array:
return np.cumsum(x) / np.arange(1,len(x) + 1)
def running_var(x:np.array) -> np.array:
means = running_mean(x)
return ((np.tril(x) - np.triu(means).T) ** 2).sum(axis=1) / np.arange(1,len(x) + 1)
So basically using the running mean funciton, but that into triangular matrices and doing the math from there. This could become slow though for large x due to creating those triangular matrices of size N x N.