Finding top n sub-groups under a main group based on values of a numeric column in Python

Solution 1:

Sort by Sales in descending order and then use groupby + head(2) to get the top 2 values:

(df.sort_values('Sales', ascending=False)
   .groupby('Category')
   .head(2)
   .sort_values('Category'))

    Category Brand  Sales
0  Furniture    F1     26
6  Furniture    F7     24
7   Supplies    S4     17
4   Supplies    S7     14
8       Tech    T4     28
5       Tech    T2     23

Another option is to use nlargest in each group:

df.groupby('Category', group_keys=False).apply(lambda g: g.nlargest(2, 'Sales'))

    Category Brand  Sales
0  Furniture    F1     26
6  Furniture    F7     24
7   Supplies    S4     17
4   Supplies    S7     14
8       Tech    T4     28
5       Tech    T2     23

Solution 2:

You can get the index of the largest elements using something like

idx = df.groupby('Category')['Sales'].nlargest(2).index

Now you can index the original dataframe with the second level of the index, which contains the original row indices:

df.loc[idx.get_level_values(1)]

The nice thing about this approach is that it is O(n) rather than O(n log n), since it uses partitioning instead of sorting, which will matter for large datasets.

Of course you can write it all in one line:

df.loc[df.groupby('Category')['Sales'].nlargest(2).index.get_level_values(1)]

Finding top n sub-groups under a main group based on values of a numeric column in Python

Solution 1:

Solution 2:

Related

Recent Posts