Finding count of distinct elements in DataFrame in each column
I am trying to find the count of distinct values in each column using Pandas. This is what I did.
import pandas as pd
import numpy as np
# Generate data.
NROW = 10000
NCOL = 100
df = pd.DataFrame(np.random.randint(1, 100000, (NROW, NCOL)),
columns=['col' + x for x in np.arange(NCOL).astype(str)])
I need to count the number of distinct elements for each column, like this:
col0 9538
col1 9505
col2 9524
What would be the most efficient way to do this, as this method will be applied to files which have size greater than 1.5GB?
Based upon the answers, df.apply(lambda x: len(x.unique()))
is the fastest (notebook).
%timeit df.apply(lambda x: len(x.unique()))
10 loops, best of 3: 49.5 ms per loop
%timeit df.nunique()
10 loops, best of 3: 59.7 ms per loop
%timeit df.apply(pd.Series.nunique)
10 loops, best of 3: 60.3 ms per loop
%timeit df.T.apply(lambda x: x.nunique(), axis=1)
10 loops, best of 3: 60.5 ms per loop
As of pandas 0.20 we can use nunique
directly on DataFrame
s, i.e.:
df.nunique()
a 4
b 5
c 1
dtype: int64
Other legacy options:
You could do a transpose of the df and then using apply
call nunique
row-wise:
In [205]:
df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]})
df
Out[205]:
a b c
0 0 1 1
1 1 2 1
2 1 3 1
3 2 4 1
4 3 5 1
In [206]:
df.T.apply(lambda x: x.nunique(), axis=1)
Out[206]:
a 4
b 5
c 1
dtype: int64
EDIT
As pointed out by @ajcr the transpose is unnecessary:
In [208]:
df.apply(pd.Series.nunique)
Out[208]:
a 4
b 5
c 1
dtype: int64
A Pandas.Series
has a .value_counts()
function that provides exactly what you want to. Check out the documentation for the function.
Already some great answers here :) but this one seems to be missing:
df.apply(lambda x: x.nunique())
As of pandas 0.20.0, DataFrame.nunique()
is also available.
Recently, I have same issues of counting unique value of each column in DataFrame, and I found some other function that runs faster than the apply
function:
#Select the way how you want to store the output, could be pd.DataFrame or Dict, I will use Dict to demonstrate:
col_uni_val={}
for i in df.columns:
col_uni_val[i] = len(df[i].unique())
#Import pprint to display dic nicely:
import pprint
pprint.pprint(col_uni_val)
This works for me almost twice faster than df.apply(lambda x: len(x.unique()))