Finding top n sub-groups under a main group based on values of a numeric column in Python
Solution 1:
Sort by Sales
in descending order and then use groupby
+ head(2)
to get the top 2 values:
(df.sort_values('Sales', ascending=False)
.groupby('Category')
.head(2)
.sort_values('Category'))
Category Brand Sales
0 Furniture F1 26
6 Furniture F7 24
7 Supplies S4 17
4 Supplies S7 14
8 Tech T4 28
5 Tech T2 23
Another option is to use nlargest
in each group:
df.groupby('Category', group_keys=False).apply(lambda g: g.nlargest(2, 'Sales'))
Category Brand Sales
0 Furniture F1 26
6 Furniture F7 24
7 Supplies S4 17
4 Supplies S7 14
8 Tech T4 28
5 Tech T2 23
Solution 2:
You can get the index of the largest elements using something like
idx = df.groupby('Category')['Sales'].nlargest(2).index
Now you can index the original dataframe with the second level of the index, which contains the original row indices:
df.loc[idx.get_level_values(1)]
The nice thing about this approach is that it is O(n)
rather than O(n log n)
, since it uses partitioning instead of sorting, which will matter for large datasets.
Of course you can write it all in one line:
df.loc[df.groupby('Category')['Sales'].nlargest(2).index.get_level_values(1)]