Pandas sort by group aggregate and column
Given the following dataframe
In [31]: rand = np.random.RandomState(1)
df = pd.DataFrame({'A': ['foo', 'bar', 'baz'] * 2,
'B': rand.randn(6),
'C': rand.rand(6) > .5})
In [32]: df
Out[32]: A B C
0 foo 1.624345 False
1 bar -0.611756 True
2 baz -0.528172 False
3 foo -1.072969 True
4 bar 0.865408 False
5 baz -2.301539 True
I would like to sort it in groups (A
) by the aggregated sum of B
, and then by the value in C
(not aggregated). So basically get the order of the A
groups with
In [28]: df.groupby('A').sum().sort('B')
Out[28]: B C
A
baz -2.829710 1
bar 0.253651 1
foo 0.551377 1
And then by True/False, so that it ultimately looks like this:
In [30]: df.ix[[5, 2, 1, 4, 3, 0]]
Out[30]: A B C
5 baz -2.301539 True
2 baz -0.528172 False
1 bar -0.611756 True
4 bar 0.865408 False
3 foo -1.072969 True
0 foo 1.624345 False
How can this be done?
Groupby A:
In [0]: grp = df.groupby('A')
Within each group, sum over B and broadcast the values using transform. Then sort by B:
In [1]: grp[['B']].transform(sum).sort('B')
Out[1]:
B
2 -2.829710
5 -2.829710
1 0.253651
4 0.253651
0 0.551377
3 0.551377
Index the original df by passing the index from above. This will re-order the A values by the aggregate sum of the B values:
In [2]: sort1 = df.ix[grp[['B']].transform(sum).sort('B').index]
In [3]: sort1
Out[3]:
A B C
2 baz -0.528172 False
5 baz -2.301539 True
1 bar -0.611756 True
4 bar 0.865408 False
0 foo 1.624345 False
3 foo -1.072969 True
Finally, sort the 'C' values within groups of 'A' using the sort=False
option to preserve the A sort order from step 1:
In [4]: f = lambda x: x.sort('C', ascending=False)
In [5]: sort2 = sort1.groupby('A', sort=False).apply(f)
In [6]: sort2
Out[6]:
A B C
A
baz 5 baz -2.301539 True
2 baz -0.528172 False
bar 1 bar -0.611756 True
4 bar 0.865408 False
foo 3 foo -1.072969 True
0 foo 1.624345 False
Clean up the df index by using reset_index
with drop=True
:
In [7]: sort2.reset_index(0, drop=True)
Out[7]:
A B C
5 baz -2.301539 True
2 baz -0.528172 False
1 bar -0.611756 True
4 bar 0.865408 False
3 foo -1.072969 True
0 foo 1.624345 False
Here's a more concise approach...
df['a_bsum'] = df.groupby('A')['B'].transform(sum)
df.sort(['a_bsum','C'], ascending=[True, False]).drop('a_bsum', axis=1)
The first line adds a column to the data frame with the groupwise sum. The second line performs the sort and then removes the extra column.
Result:
A B C
5 baz -2.301539 True
2 baz -0.528172 False
1 bar -0.611756 True
4 bar 0.865408 False
3 foo -1.072969 True
0 foo 1.624345 False
NOTE: sort
is deprecated, use sort_values
instead
One way to do this is to insert a dummy column with the sums in order to sort:
In [10]: sum_B_over_A = df.groupby('A').sum().B
In [11]: sum_B_over_A
Out[11]:
A
bar 0.253652
baz -2.829711
foo 0.551376
Name: B
in [12]: df['sum_B_over_A'] = df.A.apply(sum_B_over_A.get_value)
In [13]: df
Out[13]:
A B C sum_B_over_A
0 foo 1.624345 False 0.551376
1 bar -0.611756 True 0.253652
2 baz -0.528172 False -2.829711
3 foo -1.072969 True 0.551376
4 bar 0.865408 False 0.253652
5 baz -2.301539 True -2.829711
In [14]: df.sort(['sum_B_over_A', 'A', 'B'])
Out[14]:
A B C sum_B_over_A
5 baz -2.301539 True -2.829711
2 baz -0.528172 False -2.829711
1 bar -0.611756 True 0.253652
4 bar 0.865408 False 0.253652
3 foo -1.072969 True 0.551376
0 foo 1.624345 False 0.551376
and maybe you would drop the dummy row:
In [15]: df.sort(['sum_B_over_A', 'A', 'B']).drop('sum_B_over_A', axis=1)
Out[15]:
A B C
5 baz -2.301539 True
2 baz -0.528172 False
1 bar -0.611756 True
4 bar 0.865408 False
3 foo -1.072969 True
0 foo 1.624345 False