unique combinations of values in selected columns in pandas data frame and count
You can groupby
on cols 'A' and 'B' and call size
and then reset_index
and rename
the generated column:
In [26]:
df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})
Out[26]:
A B count
0 no no 1
1 no yes 2
2 yes no 4
3 yes yes 3
update
A little explanation, by grouping on the 2 columns, this groups rows where A and B values are the same, we call size
which returns the number of unique groups:
In[202]:
df1.groupby(['A','B']).size()
Out[202]:
A B
no no 1
yes 2
yes no 4
yes 3
dtype: int64
So now to restore the grouped columns, we call reset_index
:
In[203]:
df1.groupby(['A','B']).size().reset_index()
Out[203]:
A B 0
0 no no 1
1 no yes 2
2 yes no 4
3 yes yes 3
This restores the indices but the size aggregation is turned into a generated column 0
, so we have to rename this:
In[204]:
df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})
Out[204]:
A B count
0 no no 1
1 no yes 2
2 yes no 4
3 yes yes 3
groupby
does accept the arg as_index
which we could have set to False
so it doesn't make the grouped columns the index, but this generates a series
and you'd still have to restore the indices and so on....:
In[205]:
df1.groupby(['A','B'], as_index=False).size()
Out[205]:
A B
no no 1
yes 2
yes no 4
yes 3
dtype: int64
In Pandas 1.1.0 you can use the method value_counts
with DataFrames:
df.value_counts() # or df[['A', 'B']].value_counts()
Result:
A B
yes no 4
yes 3
no yes 2
no 1
dtype: int64
Convert index to columns and sort by value counts:
df.value_counts(ascending=True).reset_index(name='count')
Result:
A B count
0 no no 1
1 no yes 2
2 yes yes 3
3 yes no 4
Slightly related, I was looking for the unique combinations and I came up with this method:
def unique_columns(df,columns):
result = pd.Series(index = df.index)
groups = meta_data_csv.groupby(by = columns)
for name,group in groups:
is_unique = len(group) == 1
result.loc[group.index] = is_unique
assert not result.isnull().any()
return result
And if you only want to assert that all combinations are unique:
df1.set_index(['A','B']).index.is_unique