Only keep the IDs based on the observations of the other variable
Current table
ID date sender sum_sender
A Jan20 3 37
A Feb20 7 37
A Mar20 12 37
A Apr20 15 37
B Mar20 1 26
B May20 10 26
B Jun20 15 26
...
Y Jan21 10 47
Y Feb21 12 47
Y Mar21 20 47
Y Apr21 5 47
I have a panel-time series with many IDs. How do I only keep rows of observations with 10 highest values of sum_sender?
so if i want to keep the observations with 2 highest sum_sender values
desired table
ID date sender sum_sender
A Jan20 3 37
A Feb20 7 37
A Mar20 12 37
A Apr20 15 37
Y Jan21 10 47
Y Feb21 12 47
Y Mar21 20 47
Y Apr21 5 47
Use nlargest
:
N = 10
out = df.loc[df.groupby('ID')['sum_sender'].nlargest(N).index.levels[1]]
Example for N=2
with your sample:
>>> df.loc[df.groupby('ID')['sum_sender'].nlargest(N).index.levels[1]]
ID date sender sum_sender
0 A Jan20 3 37
1 A Feb20 7 37
4 B Mar20 1 26
5 B May20 10 26
7 Y Jan21 10 47
8 Y Feb21 12 47
Update
If you need the top 10 of sum_sender
independently of ID
, you can simple use:
>>> df.nlargest(columns='sum_sender', n=10)
ID date sender sum_sender
7 Y Jan21 10 47
8 Y Feb21 12 47
9 Y Mar21 20 47
10 Y Apr21 5 47
0 A Jan20 3 37
1 A Feb20 7 37
2 A Mar20 12 37
3 A Apr20 15 37
4 B Mar20 1 26
5 B May20 10 26
Update 2 Try:
>>> df.loc[df['ID'].isin(df.groupby('ID').max().nlargest(2, 'sum_sender').index)]
ID date sender sum_sender
0 A Jan20 3 37
1 A Feb20 7 37
2 A Mar20 12 37
3 A Apr20 15 37
7 Y Jan21 10 47
8 Y Feb21 12 47
9 Y Mar21 20 47
10 Y Apr21 5 47
drop_duplicates
in "sum_sender", then find the 2 largest values by nlargest
, then use isin
to filter:
largest_values = df['sum_sender'].drop_duplicates().nlargest(2)
out = df[df['sum_sender'].isin(largest_values)]
Output:
ID date sender sum_sender
0 A Jan20 3 37
1 A Feb20 7 37
2 A Mar20 12 37
3 A Apr20 15 37
7 Y Jan21 10 47
8 Y Feb21 12 47
9 Y Mar21 20 47
10 Y Apr21 5 47