Pandas: Drop consecutive duplicates
Solution 1:
Use shift
:
a.loc[a.shift(-1) != a]
Out[3]:
1 1
3 2
4 3
5 2
dtype: int64
So the above uses boolean critieria, we compare the dataframe against the dataframe shifted by -1 rows to create the mask
Another method is to use diff
:
In [82]:
a.loc[a.diff() != 0]
Out[82]:
1 1
2 2
4 3
5 2
dtype: int64
But this is slower than the original method if you have a large number of rows.
Update
Thanks to Bjarke Ebert for pointing out a subtle error, I should actually use shift(1)
or just shift()
as the default is a period of 1, this returns the first consecutive value:
In [87]:
a.loc[a.shift() != a]
Out[87]:
1 1
2 2
4 3
5 2
dtype: int64
Note the difference in index values, thanks @BjarkeEbert!
Solution 2:
Here is an update that will make it work with multiple columns. Use ".any(axis=1)" to combine the results from each column:
cols = ["col1","col2","col3"]
de_dup = a[cols].loc[(a[cols].shift() != a[cols]).any(axis=1)]
Solution 3:
Since we are going for most efficient way
, i.e. performance, let's use array data to leverage NumPy. We will slice one-off slices and compare, similar to shifting method discussed earlier in @EdChum's post
. But with NumPy slicing we would end up with one-less array, so we need to concatenate with a True
element at the start to select the first element and hence we would have an implementation like so -
def drop_consecutive_duplicates(a):
ar = a.values
return a[np.concatenate(([True],ar[:-1]!= ar[1:]))]
Sample run -
In [149]: a
Out[149]:
1 1
2 2
3 2
4 3
5 2
dtype: int64
In [150]: drop_consecutive_duplicates(a)
Out[150]:
1 1
2 2
4 3
5 2
dtype: int64
Timings on large arrays comparing @EdChum's solution
-
In [142]: a = pd.Series(np.random.randint(1,5,(1000000)))
In [143]: %timeit a.loc[a.shift() != a]
100 loops, best of 3: 12.1 ms per loop
In [144]: %timeit drop_consecutive_duplicates(a)
100 loops, best of 3: 11 ms per loop
In [145]: a = pd.Series(np.random.randint(1,5,(10000000)))
In [146]: %timeit a.loc[a.shift() != a]
10 loops, best of 3: 136 ms per loop
In [147]: %timeit drop_consecutive_duplicates(a)
10 loops, best of 3: 114 ms per loop
So, there's some improvement!
Get major boost for values only!
If only the values are needed, we could get major boost by simply indexing into the array data, like so -
def drop_consecutive_duplicates(a):
ar = a.values
return ar[np.concatenate(([True],ar[:-1]!= ar[1:]))]
Sample run -
In [170]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5])
In [171]: drop_consecutive_duplicates(a)
Out[171]: array([1, 2, 3, 2])
Timings -
In [173]: a = pd.Series(np.random.randint(1,5,(10000000)))
In [174]: %timeit a.loc[a.shift() != a]
10 loops, best of 3: 137 ms per loop
In [175]: %timeit drop_consecutive_duplicates(a)
10 loops, best of 3: 61.3 ms per loop