Only keep the IDs based on the observations of the other variable

Current table

ID   date   sender   sum_sender 
A    Jan20     3         37
A    Feb20     7         37
A    Mar20     12        37
A    Apr20     15        37
B    Mar20     1         26
B    May20     10        26
B    Jun20     15        26
...
Y    Jan21     10        47
Y    Feb21     12        47
Y    Mar21     20        47
Y    Apr21     5         47

I have a panel-time series with many IDs. How do I only keep rows of observations with 10 highest values of sum_sender?

so if i want to keep the observations with 2 highest sum_sender values

desired table

ID   date   sender   sum_sender 
A    Jan20     3         37
A    Feb20     7         37
A    Mar20     12        37
A    Apr20     15        37
Y    Jan21     10        47
Y    Feb21     12        47
Y    Mar21     20        47
Y    Apr21     5         47

Use nlargest:

N = 10
out = df.loc[df.groupby('ID')['sum_sender'].nlargest(N).index.levels[1]]

Example for N=2 with your sample:

>>> df.loc[df.groupby('ID')['sum_sender'].nlargest(N).index.levels[1]]
  ID   date  sender  sum_sender
0  A  Jan20       3          37
1  A  Feb20       7          37
4  B  Mar20       1          26
5  B  May20      10          26
7  Y  Jan21      10          47
8  Y  Feb21      12          47

Update

If you need the top 10 of sum_sender independently of ID, you can simple use:

>>> df.nlargest(columns='sum_sender', n=10)
   ID   date  sender  sum_sender
7   Y  Jan21      10          47
8   Y  Feb21      12          47
9   Y  Mar21      20          47
10  Y  Apr21       5          47
0   A  Jan20       3          37
1   A  Feb20       7          37
2   A  Mar20      12          37
3   A  Apr20      15          37
4   B  Mar20       1          26
5   B  May20      10          26

Update 2 Try:

>>> df.loc[df['ID'].isin(df.groupby('ID').max().nlargest(2, 'sum_sender').index)]
   ID   date  sender  sum_sender
0   A  Jan20       3          37
1   A  Feb20       7          37
2   A  Mar20      12          37
3   A  Apr20      15          37
7   Y  Jan21      10          47
8   Y  Feb21      12          47
9   Y  Mar21      20          47
10  Y  Apr21       5          47

drop_duplicates in "sum_sender", then find the 2 largest values by nlargest, then use isin to filter:

largest_values = df['sum_sender'].drop_duplicates().nlargest(2)
out = df[df['sum_sender'].isin(largest_values)]

Output:

   ID   date  sender  sum_sender
0   A  Jan20       3          37
1   A  Feb20       7          37
2   A  Mar20      12          37
3   A  Apr20      15          37
7   Y  Jan21      10          47
8   Y  Feb21      12          47
9   Y  Mar21      20          47
10  Y  Apr21       5          47

Only keep the IDs based on the observations of the other variable

Related

Recent Posts