Pandas 'count(distinct)' equivalent
Solution 1:
I believe this is what you want:
table.groupby('YEARMONTH').CLIENTCODE.nunique()
Example:
In [2]: table
Out[2]:
CLIENTCODE YEARMONTH
0 1 201301
1 1 201301
2 2 201301
3 1 201302
4 2 201302
5 2 201302
6 3 201302
In [3]: table.groupby('YEARMONTH').CLIENTCODE.nunique()
Out[3]:
YEARMONTH
201301 2
201302 3
Solution 2:
Here is another method and it is much simpler. Let’s say your dataframe name is daat
and the column name is YEARMONTH
:
daat.YEARMONTH.value_counts()
Solution 3:
Interestingly enough, very often len(unique())
is a few times (3x-15x) faster than nunique()
.
Solution 4:
I am also using nunique
but it will be very helpful if you have to use an aggregate function like 'min', 'max', 'count' or 'mean'
etc.
df.groupby('YEARMONTH')['CLIENTCODE'].transform('nunique') #count(distinct)
df.groupby('YEARMONTH')['CLIENTCODE'].transform('min') #min
df.groupby('YEARMONTH')['CLIENTCODE'].transform('max') #max
df.groupby('YEARMONTH')['CLIENTCODE'].transform('mean') #average
df.groupby('YEARMONTH')['CLIENTCODE'].transform('count') #count
Solution 5:
Distinct of column along with aggregations on other columns
To get the distinct number of values for any column (CLIENTCODE
in your case), we can use nunique
. We can pass the input as a dictionary in agg
function, along with aggregations on other columns:
grp_df = df.groupby('YEARMONTH').agg({'CLIENTCODE': ['nunique'],
'other_col_1': ['sum', 'count']})
# to flatten the multi-level columns
grp_df.columns = ["_".join(col).strip() for col in grp_df.columns.values]
# if you wish to reset the index
grp_df.reset_index(inplace=True)