Pandas: Drop consecutive duplicates

Solution 1:

Use shift:

a.loc[a.shift(-1) != a]

Out[3]:

1    1
3    2
4    3
5    2
dtype: int64

So the above uses boolean critieria, we compare the dataframe against the dataframe shifted by -1 rows to create the mask

Another method is to use diff:

In [82]:

a.loc[a.diff() != 0]
Out[82]:
1    1
2    2
4    3
5    2
dtype: int64

But this is slower than the original method if you have a large number of rows.

Update

Thanks to Bjarke Ebert for pointing out a subtle error, I should actually use shift(1) or just shift() as the default is a period of 1, this returns the first consecutive value:

In [87]:

a.loc[a.shift() != a]
Out[87]:
1    1
2    2
4    3
5    2
dtype: int64

Note the difference in index values, thanks @BjarkeEbert!

Solution 2:

Here is an update that will make it work with multiple columns. Use ".any(axis=1)" to combine the results from each column:

cols = ["col1","col2","col3"]
de_dup = a[cols].loc[(a[cols].shift() != a[cols]).any(axis=1)]

Solution 3:

Since we are going for most efficient way, i.e. performance, let's use array data to leverage NumPy. We will slice one-off slices and compare, similar to shifting method discussed earlier in @EdChum's post. But with NumPy slicing we would end up with one-less array, so we need to concatenate with a True element at the start to select the first element and hence we would have an implementation like so -

def drop_consecutive_duplicates(a):
    ar = a.values
    return a[np.concatenate(([True],ar[:-1]!= ar[1:]))]

Sample run -

In [149]: a
Out[149]: 
1    1
2    2
3    2
4    3
5    2
dtype: int64

In [150]: drop_consecutive_duplicates(a)
Out[150]: 
1    1
2    2
4    3
5    2
dtype: int64

Timings on large arrays comparing @EdChum's solution -

In [142]: a = pd.Series(np.random.randint(1,5,(1000000)))

In [143]: %timeit a.loc[a.shift() != a]
100 loops, best of 3: 12.1 ms per loop

In [144]: %timeit drop_consecutive_duplicates(a)
100 loops, best of 3: 11 ms per loop

In [145]: a = pd.Series(np.random.randint(1,5,(10000000)))

In [146]: %timeit a.loc[a.shift() != a]
10 loops, best of 3: 136 ms per loop

In [147]: %timeit drop_consecutive_duplicates(a)
10 loops, best of 3: 114 ms per loop

So, there's some improvement!

Get major boost for values only!

If only the values are needed, we could get major boost by simply indexing into the array data, like so -

def drop_consecutive_duplicates(a):
    ar = a.values
    return ar[np.concatenate(([True],ar[:-1]!= ar[1:]))]

Sample run -

In [170]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5])

In [171]: drop_consecutive_duplicates(a)
Out[171]: array([1, 2, 3, 2])

Timings -

In [173]: a = pd.Series(np.random.randint(1,5,(10000000)))

In [174]: %timeit a.loc[a.shift() != a]
10 loops, best of 3: 137 ms per loop

In [175]: %timeit drop_consecutive_duplicates(a)
10 loops, best of 3: 61.3 ms per loop

Pandas: Drop consecutive duplicates

Solution 1:

Solution 2:

Solution 3:

Related

Recent Posts