Calculate average of every x rows in a table and create new table
You can create an artificial group using df.index//2
(or as @DSM pointed out, using np.arange(len(df))//2
- so that it works for all indices) and then use groupby:
df.groupby(np.arange(len(df))//2).mean()
Out[13]:
a b c d
0 3.0 30.5 31.5 35.0
1 7.0 35.0 21.5 25.0
2 11.0 37.5 41.5 38.5
3 15.0 10.0 16.0 18.5
4 19.0 15.5 27.0 38.0
You can approach this problem using pd.rolling()
to create a rolling average and then just grab every second element using iloc
df = df.rolling(2).mean()
df = df.iloc[::2, :]
Note that the first observation will be missing (i.e. the rolling starts at the top) so make sure to check that your data is sorted how you need it.
NumPythonic way would be to extract the elements as a NumPy array with df.values
, then reshape to a 3D
array with 2
elements along axis=1
and 4
along axis=2
and perform the average reduction along axis=1
and finally convert back to a dataframe, like so -
pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1))
As it turns out, you can introduce NumPy's very efficient tool : np.einsum
to do this average-reduction
as a combination of sum-reduction
and scaling-down
, like so -
pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0)
Please note that the proposed approaches assume that the number of rows is divisible by 2
.
Also as noted by @DSM
, to preserve the column names, you need to add columns=df.columns
when converting back to Dataframe, i.e. -
pd.DataFrame(...,columns=df.columns)
Sample run -
>>> df
0 1 2 3
0 2 50 25 26
1 4 11 38 44
2 6 33 16 25
3 8 37 27 25
4 10 28 48 32
5 12 47 35 45
6 14 8 16 7
7 16 12 16 30
8 18 22 39 29
9 20 9 15 47
>>> pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1))
0 1 2 3
0 3 30.5 31.5 35.0
1 7 35.0 21.5 25.0
2 11 37.5 41.5 38.5
3 15 10.0 16.0 18.5
4 19 15.5 27.0 38.0
>>> pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0)
0 1 2 3
0 3 30.5 31.5 35.0
1 7 35.0 21.5 25.0
2 11 37.5 41.5 38.5
3 15 10.0 16.0 18.5
4 19 15.5 27.0 38.0
Runtime tests -
In this section, let's test out all the three approaches listed thus far to solve the problem for performance, including @ayhan's solution with groupby
.
In [24]: A = np.random.randint(0,9,(200,50))
In [25]: df = pd.DataFrame(A)
In [26]: %timeit df.groupby(df.index//2).mean() # @ayhan's solution
1000 loops, best of 3: 1.61 ms per loop
In [27]: %timeit pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1))
1000 loops, best of 3: 317 µs per loop
In [28]: %timeit pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0)
1000 loops, best of 3: 266 µs per loop
df.set_index(np.arange(len(df)) // 2).mean(level=0)