Conditional cumulative sum in Python/Pandas
Solution 1:
You want to take the cumulative sum of data_binary
and subtract the most recent cumulative sum where data_binary
was zero.
b = df.data_binary
c = b.cumsum()
c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)
Output
0 1
1 0
2 1
3 2
4 3
5 0
6 0
7 1
Name: data_binary, dtype: int64
Explanation
Let's start by looking at each step side by side
cols = ['data_binary', 'cumulative_sum', 'nan_non_zero', 'forward_fill', 'final_result']
print(pd.concat([
b, c,
c.mask(b != 0),
c.mask(b != 0).ffill(),
c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)
], axis=1, keys=cols))
Output
data_binary cumulative_sum nan_non_zero forward_fill final_result
0 1 1 NaN NaN 1
1 0 1 1.0 1.0 0
2 1 2 NaN 1.0 1
3 1 3 NaN 1.0 2
4 1 4 NaN 1.0 3
5 0 4 4.0 4.0 0
6 0 4 4.0 4.0 0
7 1 5 NaN 4.0 1
The problem with cumulative_sum
is that the rows where data_binary
is zero, do not reset the sum. And that is the motivation for this solution. How do we "reset" the sum when data_binary
is zero? Easy! I slice the cumulative sum where data_binary
is zero and forward fill the values. When I take the difference between this and the cumulative sum, I've effectively reset the sum.
Solution 2:
I think you can groupby
with DataFrameGroupBy.cumsum
by Series
, where first compare the next value by the shift
ed column if not equal (!=
) and then create groups by cumsum
. Last, replace 0
by column data_binary
with mask
:
print (df.data_binary.ne(df.data_binary.shift()).cumsum())
0 1
1 2
2 3
3 3
4 3
5 4
6 4
7 5
Name: data_binary, dtype: int32
df['sum_data1'] = df.data_binary.groupby(df.data_binary.ne(df.data_binary.shift()).cumsum())
.cumsum()
df['sum_data1'] = df['sum_data1'].mask(df.data_binary == 0, 0)
print (df)
data data_binary sum_data sum_data1
0 2 1 1 1
1 5 0 0 0
2 1 1 1 1
3 4 1 2 2
4 3 1 3 3
5 10 0 0 0
6 7 0 0 0
7 3 1 1 1