How to count duplicate rows in pandas dataframe?
I am trying to count the duplicates of each type of row in my dataframe. For example, say that I have a dataframe in pandas as follows:
df = pd.DataFrame({'one': pd.Series([1., 1, 1]),
'two': pd.Series([1., 2., 1])})
I get a df that looks like this:
one two
0 1 1
1 1 2
2 1 1
I imagine the first step is to find all the different unique rows, which I do by:
df.drop_duplicates()
This gives me the following df:
one two
0 1 1
1 1 2
Now I want to take each row from the above df ([1 1] and [1 2]) and get a count of how many times each is in the initial df. My result would look something like this:
Row Count
[1 1] 2
[1 2] 1
How should I go about doing this last step?
Edit:
Here's a larger example to make it more clear:
df = pd.DataFrame({'one': pd.Series([True, True, True, False]),
'two': pd.Series([True, False, False, True]),
'three': pd.Series([True, False, False, False])})
gives me:
one three two
0 True True True
1 True False False
2 True False False
3 False False True
I want a result that tells me:
Row Count
[True True True] 1
[True False False] 2
[False False True] 1
Solution 1:
You can groupby
on all the columns and call size
the index indicates the duplicate values:
In [28]:
df.groupby(df.columns.tolist(),as_index=False).size()
Out[28]:
one three two
False False True 1
True False False 2
True True 1
dtype: int64
Solution 2:
df.groupby(df.columns.tolist()).size().reset_index().\
rename(columns={0:'records'})
one two records
0 1 1 2
1 1 2 1
Solution 3:
Specific to your question, as the others mentioned fast and easy way would be:
df.groupby(df.columns.tolist(),as_index=False).size()
If you like to count duplicates on particular column(s):
len(df['one'])-len(df['one'].drop_duplicates())
If you want to count duplicates on entire dataframe:
len(df)-len(df.drop_duplicates())
Or simply you can use DataFrame.duplicated(subset=None, keep='first'):
df.duplicated(subset='one', keep='first').sum()
where
subset : column label or sequence of labels(by default use all of the columns)
keep : {‘first’, ‘last’, False}, default ‘first’
- first : Mark duplicates as True except for the first occurrence.
- last : Mark duplicates as True except for the last occurrence.
- False : Mark all duplicates as True.
Solution 4:
I use:
used_features =[
"one",
"two",
"three"
]
df['is_duplicated'] = df.duplicated(used_features)
df['is_duplicated'].sum()
which gives count of duplicated rows, and then you can analyse them by a new column. I didn't see such solution here.
Solution 5:
None of the existing answers quite offers a simple solution that returns "the number of rows that are just duplicates and should be cut out". This is a one-size-fits-all solution that does:
# generate a table of those culprit rows which are duplicated:
dups = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'count'})
# sum the final col of that table, and subtract the number of culprits:
dups['count'].sum() - dups.shape[0]