Counting consecutive positive values in Python/pandas array
Solution 1:
>>> y = pandas.Series([0,0,1,1,1,0,0,1,0,1,1])
The following may seem a little magical, but actually uses some common idioms: since pandas
doesn't yet have nice native support for a contiguous groupby
, you often find yourself needing something like this.
>>> y * (y.groupby((y != y.shift()).cumsum()).cumcount() + 1)
0 0
1 0
2 1
3 2
4 3
5 0
6 0
7 1
8 0
9 1
10 2
dtype: int64
Some explanation: first, we compare y
against a shifted version of itself to find when the contiguous groups begin:
>>> y != y.shift()
0 True
1 False
2 True
3 False
4 False
5 True
6 False
7 True
8 True
9 True
10 False
dtype: bool
Then (since False == 0 and True == 1) we can apply a cumulative sum to get a number for the groups:
>>> (y != y.shift()).cumsum()
0 1
1 1
2 2
3 2
4 2
5 3
6 3
7 4
8 5
9 6
10 6
dtype: int32
We can use groupby
and cumcount
to get us an integer counting up in each group:
>>> y.groupby((y != y.shift()).cumsum()).cumcount()
0 0
1 1
2 0
3 1
4 2
5 0
6 1
7 0
8 0
9 0
10 1
dtype: int64
Add one:
>>> y.groupby((y != y.shift()).cumsum()).cumcount() + 1
0 1
1 2
2 1
3 2
4 3
5 1
6 2
7 1
8 1
9 1
10 2
dtype: int64
And finally zero the values where we had zero to begin with:
>>> y * (y.groupby((y != y.shift()).cumsum()).cumcount() + 1)
0 0
1 0
2 1
3 2
4 3
5 0
6 0
7 1
8 0
9 1
10 2
dtype: int64
Solution 2:
If something is clear, it is "pythonic". Frankly, I cannot even make your original solution work. Also, if it does work, I am curious if it is faster than a loop. Did you compare?
Now, since we've started discussing efficiency, here are some insights.
Loops in Python are inherently slow, no matter what you do. Of course, if you are using pandas, you are also using numpy underneath, with all the performance advantages. Just don't destroy them by looping. This is not to mention that Python lists take a lot more memory than you may think; potentially much more than 8 bytes * length
, as every integer may be wrapped into a separate object and placed into a separate area in memory, and pointed at by a pointer from the list.
Vectorization provided by numpy should be sufficient IF you can find some way to express this function without looping. In fact, I wonder if there some way to represent it by using expressions such as A+B*C
. If you can construct this function out of functions in Lapack, then you can even potentially beat ordinary C++ code compiled with optimization.
You can also use one of the compiled approaches to speed-up your loops. See a solution with Numba on numpy arrays below. Another option is to use PyPy, though you probably can't properly combine it with pandas.
In [140]: import pandas as pd
In [141]: import numpy as np
In [143]: a=np.random.randint(2,size=1000000)
# Try the simple approach
In [147]: def simple(L):
for i in range(len(L)):
if L[i]==1:
L[i] += L[i-1]
In [148]: %time simple(L)
CPU times: user 255 ms, sys: 20.8 ms, total: 275 ms
Wall time: 248 ms
# Just-In-Time compilation
In[149]: from numba import jit
@jit
def faster(z):
prev=0
for i in range(len(z)):
cur=z[i]
if cur==0:
prev=0
else:
prev=prev+cur
z[i]=prev
In [151]: %time faster(a)
CPU times: user 51.9 ms, sys: 1.12 ms, total: 53 ms
Wall time: 51.9 ms
In [159]: list(L)==list(a)
Out[159]: True
In fact, most of the time in the second example above was spent on Just-In-Time compilation. Instead (remember to copy, as the function changes the array).
b=a.copy()
In [38]: %time faster(b)
CPU times: user 55.1 ms, sys: 1.56 ms, total: 56.7 ms
Wall time: 56.3 ms
In [39]: %time faster(c)
CPU times: user 10.8 ms, sys: 42 µs, total: 10.9 ms
Wall time: 10.9 ms
So for subsequent calls we have a 25x-speedup compared to the simple version. I suggest you read High Performance Python if you want to know more.
Solution 3:
Keeping things simple, using one array, one loop, and one conditional.
a = [0,0,1,1,1,0,0,1,0,1,1]
for i in range(1, len(a)):
if a[i] == 1:
a[i] += a[i - 1]