Pandas, groupby and count
I have a dataframe say like this
>>> df = pd.DataFrame({'user_id':['a','a','s','s','s'],
'session':[4,5,4,5,5],
'revenue':[-1,0,1,2,1]})
>>> df
revenue session user_id
0 -1 4 a
1 0 5 a
2 1 4 s
3 2 5 s
4 1 5 s
And each value of session and revenue represents a kind of type, and I want to count the number of each kind say the number of revenue=-1
and session=4
of user_id=a
is 1.
And I found simple call count()
function after groupby()
can't output the result I want.
>>> df.groupby('user_id').count()
revenue session
user_id
a 2 2
s 3 3
How can I do that?
Solution 1:
You seem to want to group by several columns at once:
df.groupby(['revenue','session','user_id'])['user_id'].count()
should give you what you want
Solution 2:
pandas >= 1.1: df.value_counts
is available!
From pandas 1.1, this will be my recommended method for counting the number of rows in groups (i.e., the group size). To count the number of non-nan rows in a group for a specific column, check out the accepted answer.
Old
df.groupby(['A', 'B']).size() # df.groupby(['A', 'B'])['C'].count()
New [✓]
df.value_counts(subset=['A', 'B'])
Note that size
and count
are not identical, the former counts all rows per group, the latter counts non-null rows only. See this other answer of mine for more.
Minimal Example
pd.__version__
# '1.1.0.dev0+2004.g8d10bfb6f'
df = pd.DataFrame({'num_legs': [2, 4, 4, 6],
'num_wings': [2, 0, 0, 0]},
index=['falcon', 'dog', 'cat', 'ant'])
df
num_legs num_wings
falcon 2 2
dog 4 0
cat 4 0
ant 6 0
df.value_counts(subset=['num_legs', 'num_wings'], sort=False)
num_legs num_wings
2 2 1
4 0 2
6 0 1
dtype: int64
Compare this output with
df.groupby(['num_legs', 'num_wings'])['num_legs'].size()
num_legs num_wings
2 2 1
4 0 2
6 0 1
Name: num_legs, dtype: int64
Performance
It's also faster if you don't sort the result:
%timeit df.groupby(['num_legs', 'num_wings'])['num_legs'].count()
%timeit df.value_counts(subset=['num_legs', 'num_wings'], sort=False)
640 µs ± 28.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
568 µs ± 6.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Solution 3:
I struggled with the same issue, made use of the solution provided above. You can actually designate any of the columns to count:
df.groupby(['revenue','session','user_id'])['revenue'].count()
and
df.groupby(['revenue','session','user_id'])['session'].count()
would give the same answer.