Pandas - dataframe groupby - how to get sum of multiple columns
This should be an easy one, but somehow I couldn't find a solution that works.
I have a pandas dataframe which looks like this:
index col1 col2 col3 col4 col5
0 a c 1 2 f
1 a c 1 2 f
2 a d 1 2 f
3 b d 1 2 g
4 b e 1 2 g
5 b e 1 2 g
I want to group by col1 and col2 and get the sum()
of col3 and col4. Col5
can be dropped, since the data can not be aggregated.
Here is how the output should look like. I am interested in having both col3
and col4
in the resulting dataframe. It doesn't really matter if col1
and col2
are part of the index or not.
index col1 col2 col3 col4
0 a c 2 4
1 a d 1 2
2 b d 1 2
3 b e 2 4
Here is what I tried:
df_new = df.groupby(['col1', 'col2'])["col3", "col4"].sum()
That however only returns the aggregated results of col4
.
I am lost here. Every example I found only aggregates one column, where the issue obviously doesn't occur.
By using apply
df.groupby(['col1', 'col2'])["col3", "col4"].apply(lambda x : x.astype(int).sum())
Out[1257]:
col3 col4
col1 col2
a c 2 4
d 1 2
b d 1 2
e 2 4
If you want to agg
df.groupby(['col1', 'col2']).agg({'col3':'sum','col4':'sum'})
Another generic solution is
df.groupby(['col1','col2']).agg({'col3':'sum','col4':'sum'}).reset_index()
This will give you the required output.
UPDATED (June 2020): Introduced in Pandas 0.25.0, Pandas has added new groupby behavior “named aggregation” and tuples, for naming the output columns when applying multiple aggregation functions to specific columns.
df.groupby(['col1','col2']).agg(
sum_col3 = ('col3','sum'),
sum_col4 = ('col4','sum'),
).reset_index()
Also, you can name new columns, e.g. I've used 'sum_col3' and 'sum_col4', but you can use any name you want.
Refer to Link for detailed description.
The above answer didn't work for me.
df_new = df.groupby(['col1', 'col2']).sum()[["col3", "col4"]]
I was grouping by single group by and sum columns.
Here is the one worked for me.
D1.groupby(['col1'])['col2'].sum() << The sum at the end not the middle.
Due to pandas FutureWarning: Indexing with multiple keys discussed on GitHub and Stack Overflow, I recommend this solution:
df.groupby(['col1', 'col2'])[['col3', 'col4']].sum().reset_index()
Output: