pandas.Grouper for time intervals behavior
When applying:
df.groupby(['id', pd.Grouper(key="timestamp", freq='5min', origin='start')])
The origin
of the timestamp
grouping is the first timestamp in the entire dataframe, not per group.
According to the doc ‘start’: origin is the first value of the timeseries
https://pandas.pydata.org/docs/reference/api/pandas.Grouper.html
looking at df.groupby(['id', pd.Grouper(key="timestamp", freq='5min', origin='start')]).size()
you can see that all groups are in 5 minute intervals (or multiplications of 5 minute intervals) even groups that are per different id
:
id timestamp
1 2021-10-24 17:56:03.641 3
2 2021-10-24 19:31:03.641 10
2021-10-24 19:36:03.641 9
...
6 2021-10-24 18:01:03.641 7
2021-10-24 18:06:03.641 13
...
If you look at id
6, its first group is actually at an earlier timestamp than it's first event. This is caused for the same reason - the "buckets" for all users are based on 5 minute intervals from the first timestamp of the entire dataset. all rows that are before 18:06:03.641
are grouped in a 18:01:03.641
"bucket" and all rows that are after are grouped to the 18:06:03.641
"bucket".
The first row of the dataset is the earliest, so when you remove the first user the bug is no longer visible.
I think you can get the functionality you're looking for by first grouping by id
and then applying an additional group-with-grouper using apply:
def split_to_five_minute_groups(x):
return (x.groupby([pd.Grouper(key="timestamp", freq='5min', origin='start')]))[['idx']].transform('first')
df['first_idx'] = df.groupby(['id']).apply(split_to_five_minute_groups)
I think this is because group[er origin is looking at first timestamp in the entire series, and not per grouped id.
This seems to work:
def tgs(df):
df_list = [g for _,g in df.groupby('id')]
res_list = []
for df_s in df_list:
g = df_s.groupby([pd.Grouper(key="timestamp", freq='5min', origin='start')])
df_s['first_index'] = g['idx'].transform('first')
res_list.append(df_s)
return pd.concat(res_list)