Identify and count segments between a start and an end marker
Usually problems like these are solved by fiddling with cumsum
and shift
.
The main idea for this solution is to identify rows where the number of "starts" seen is ahead of the number of "ends" seen by one.
The only assumption I made is that 'start'
and 'end'
alternate, beginning with a 'start'
.
>>> values = df['flag'].eq('start').cumsum()
>>> where = values.sub(1).eq(df['flag'].eq('end').cumsum().shift(1).fillna(0))
>>> df['flag_periods'] = df['flag'].mask(where, values)
>>> df
flag flag_periods
0 None None
1 start 1
2 None 1
3 None 1
4 end 1
5 start 2
6 end 2
7 None None
8 start 3
9 None 3
10 end 3
11 None None
Visualization:
>>> df['values'] = df.eq('start').cumsum()
>>> df['end_cumsum'] = df['flag'].eq('end').cumsum()
>>> df['end_cumsum_s1'] = df['end_cumsum'].shift(1).fillna(0)
>>> df['values-1'] = df['values'].sub(1)
>>> df['where'] = df['values-1'].eq(df['end_cumsum_s1'])
>>> df
flag values end_cumsum end_cumsum_s1 values-1 where
0 None 0 0 0.0 -1 False
1 start 1 0 0.0 0 True
2 None 1 0 0.0 0 True
3 None 1 0 0.0 0 True
4 end 1 1 0.0 0 True
5 start 2 1 1.0 1 True
6 end 2 2 1.0 1 True
7 None 2 2 2.0 1 False
8 start 3 2 2.0 2 True
9 None 3 2 2.0 2 True
10 end 3 3 2.0 2 True
11 None 3 3 3.0 2 False
edit: added .fillna(0)
to account for dataframes where the first value in the 'flag'
column is 'start'
.