pandas.Grouper for time intervals behavior

When applying:

df.groupby(['id', pd.Grouper(key="timestamp", freq='5min', origin='start')])

The origin of the timestamp grouping is the first timestamp in the entire dataframe, not per group.

According to the doc ‘start’: origin is the first value of the timeseries https://pandas.pydata.org/docs/reference/api/pandas.Grouper.html

looking at df.groupby(['id', pd.Grouper(key="timestamp", freq='5min', origin='start')]).size() you can see that all groups are in 5 minute intervals (or multiplications of 5 minute intervals) even groups that are per different id :

id  timestamp              
1   2021-10-24 17:56:03.641     3
2   2021-10-24 19:31:03.641    10
    2021-10-24 19:36:03.641     9
...

6   2021-10-24 18:01:03.641     7
    2021-10-24 18:06:03.641    13
    ...

If you look at id 6, its first group is actually at an earlier timestamp than it's first event. This is caused for the same reason - the "buckets" for all users are based on 5 minute intervals from the first timestamp of the entire dataset. all rows that are before 18:06:03.641 are grouped in a 18:01:03.641 "bucket" and all rows that are after are grouped to the 18:06:03.641 "bucket".

The first row of the dataset is the earliest, so when you remove the first user the bug is no longer visible.

I think you can get the functionality you're looking for by first grouping by id and then applying an additional group-with-grouper using apply:

def split_to_five_minute_groups(x):
  return (x.groupby([pd.Grouper(key="timestamp", freq='5min', origin='start')]))[['idx']].transform('first')

df['first_idx'] = df.groupby(['id']).apply(split_to_five_minute_groups)

I think this is because group[er origin is looking at first timestamp in the entire series, and not per grouped id.

This seems to work:

def tgs(df):
  df_list = [g for _,g in df.groupby('id')]
  res_list = []
  for df_s in df_list:
    g = df_s.groupby([pd.Grouper(key="timestamp", freq='5min', origin='start')])
    df_s['first_index'] = g['idx'].transform('first')
    res_list.append(df_s)
  return pd.concat(res_list)

NLS missing while accessing property file in eclipse plugin development

Disabled Textbox Font Colour

Why is TIMESTAMP fractional seconds off by .001 when exported from DB2 to MSSQL data type DATE, DATETIME, TIMESTAMP not as VARCHAR()

Mask of boolean 2D numpy array with True values for elements contained in another 1D numpy array

custom loss function in Keras combining multiple outputs

passing functions into another function, resulting in ValueError

Azure Synapse Serverless. HashBytes: The query references an object that is not supported in distributed processing mode

R iterating through 1600 cols in df with binary values 0 and 1 and copy values from two other columns to save in an array by group

Plot raster by nrow and ncol (dimensions) in R

What is the order of precedence when there are multiple Spring's environment profiles as set by spring.profiles.active

How to pass BASH shell variables into AWK statement

Specification pattern with Entity Framework generic repository pattern