pandas.Grouper for time intervals behavior

When applying:

df.groupby(['id', pd.Grouper(key="timestamp", freq='5min', origin='start')])

The origin of the timestamp grouping is the first timestamp in the entire dataframe, not per group.

According to the doc ‘start’: origin is the first value of the timeseries https://pandas.pydata.org/docs/reference/api/pandas.Grouper.html

looking at df.groupby(['id', pd.Grouper(key="timestamp", freq='5min', origin='start')]).size() you can see that all groups are in 5 minute intervals (or multiplications of 5 minute intervals) even groups that are per different id :

id  timestamp              
1   2021-10-24 17:56:03.641     3
2   2021-10-24 19:31:03.641    10
    2021-10-24 19:36:03.641     9
...

6   2021-10-24 18:01:03.641     7
    2021-10-24 18:06:03.641    13
    ...

If you look at id 6, its first group is actually at an earlier timestamp than it's first event. This is caused for the same reason - the "buckets" for all users are based on 5 minute intervals from the first timestamp of the entire dataset. all rows that are before 18:06:03.641 are grouped in a 18:01:03.641 "bucket" and all rows that are after are grouped to the 18:06:03.641 "bucket".

The first row of the dataset is the earliest, so when you remove the first user the bug is no longer visible.

I think you can get the functionality you're looking for by first grouping by id and then applying an additional group-with-grouper using apply:

def split_to_five_minute_groups(x):
  return (x.groupby([pd.Grouper(key="timestamp", freq='5min', origin='start')]))[['idx']].transform('first')

df['first_idx'] = df.groupby(['id']).apply(split_to_five_minute_groups)

I think this is because group[er origin is looking at first timestamp in the entire series, and not per grouped id.

This seems to work:

def tgs(df):
  df_list = [g for _,g in df.groupby('id')]
  res_list = []
  for df_s in df_list:
    g = df_s.groupby([pd.Grouper(key="timestamp", freq='5min', origin='start')])
    df_s['first_index'] = g['idx'].transform('first')
    res_list.append(df_s)
  return pd.concat(res_list)