What is the difference between pandas agg and apply function?
I can't figure out the difference between Pandas .aggregate
and .apply
functions.
Take the following as an example: I load a dataset, do a groupby
, define a simple function,
and either user .agg
or .apply
.
As you may see, the printing statement within my function results in the same output
after using .agg
and .apply
. The result, on the other hand is different. Why is that?
import pandas
import pandas as pd
iris = pd.read_csv('iris.csv')
by_species = iris.groupby('Species')
def f(x):
...: print type(x)
...: print x.head(3)
...: return 1
Using apply
:
by_species.apply(f)
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#0 5.1 3.5 1.4 0.2 setosa
#1 4.9 3.0 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#0 5.1 3.5 1.4 0.2 setosa
#1 4.9 3.0 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#50 7.0 3.2 4.7 1.4 versicolor
#51 6.4 3.2 4.5 1.5 versicolor
#52 6.9 3.1 4.9 1.5 versicolor
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#100 6.3 3.3 6.0 2.5 virginica
#101 5.8 2.7 5.1 1.9 virginica
#102 7.1 3.0 5.9 2.1 virginica
#Out[33]:
#Species
#setosa 1
#versicolor 1
#virginica 1
#dtype: int64
Using agg
by_species.agg(f)
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#0 5.1 3.5 1.4 0.2 setosa
#1 4.9 3.0 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#50 7.0 3.2 4.7 1.4 versicolor
#51 6.4 3.2 4.5 1.5 versicolor
#52 6.9 3.1 4.9 1.5 versicolor
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#100 6.3 3.3 6.0 2.5 virginica
#101 5.8 2.7 5.1 1.9 virginica
#102 7.1 3.0 5.9 2.1 virginica
#Out[34]:
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#Species
#setosa 1 1 1 1
#versicolor 1 1 1 1
#virginica 1 1 1 1
apply
applies the function to each group (your Species
). Your function returns 1, so you end up with 1 value for each of 3 groups.
agg
aggregates each column (feature) for each group, so you end up with one value per column per group.
Do read the groupby
docs, they're quite helpful. There are also a bunch of tutorials floating around the web.
(Note: These comparisons are relevant for DataframeGroupby objects)
Some plausible advantages of using .agg()
compared to .apply()
, for DataFrame GroupBy objects would be:
.agg()
gives the flexibility of applying multiple functions at once, or pass a list of function to each column.Also, applying different functions at once to different columns of dataframe.
That means you have pretty much control over each column with each operation.
Here is the link for more details: http://pandas.pydata.org/pandas-docs/version/0.13.1/groupby.html
However, the apply
function could be limited to apply one function to each column of the dataframe at a time. So, you might have to call the apply function repeatedly to call upon different operations to the same column.
Here are some example comparisons for .apply()
vs .agg()
for DataframeGroupBy objects :
Given the following dataframe:
In [261]: df = pd.DataFrame({"name":["Foo", "Baar", "Foo", "Baar"], "score_1":[5,10,15,10], "score_2" :[10,15,10,25], "score_3" : [10,20,30,40]})
In [262]: df
Out[262]:
name score_1 score_2 score_3
0 Foo 5 10 10
1 Baar 10 15 20
2 Foo 15 10 30
3 Baar 10 25 40
Lets first see the operations using .apply()
:
In [263]: df.groupby(["name", "score_1"])["score_2"].apply(lambda x : x.sum())
Out[263]:
name score_1
Baar 10 40
Foo 5 10
15 10
Name: score_2, dtype: int64
In [264]: df.groupby(["name", "score_1"])["score_2"].apply(lambda x : x.min())
Out[264]:
name score_1
Baar 10 15
Foo 5 10
15 10
Name: score_2, dtype: int64
In [265]: df.groupby(["name", "score_1"])["score_2"].apply(lambda x : x.mean())
Out[265]:
name score_1
Baar 10 20.0
Foo 5 10.0
15 10.0
Name: score_2, dtype: float64
Now, look at the same operations using .agg( ) effortlessly:
In [276]: df.groupby(["name", "score_1"]).agg({"score_3" :[np.sum, np.min, np.mean, np.max], "score_2":lambda x : x.mean()})
Out[276]:
score_2 score_3
<lambda> sum amin mean amax
name score_1
Baar 10 20 60 20 30 40
Foo 5 10 10 10 10 10
15 10 30 30 30 30
So, .agg()
could be really handy at handling the DataFrameGroupBy objects, as compared to .apply()
. But, if you are handling only pure dataframe objects and not DataFrameGroupBy objects, then apply()
can be very useful, as apply()
can apply a function along any axis of the dataframe.
(For Eg: axis = 0
implies column-wise operation with .apply(),
which is a default mode, and axis = 1
would imply for row-wise operation while dealing with pure dataframe objects).