Why is np.where faster than pd.apply
Sample code is here
import pandas as pd
import numpy as np
df = pd.DataFrame({'Customer' : ['Bob', 'Ken', 'Steve', 'Joe'],
'Spending' : [130,22,313,46]})
#[400000 rows x 4 columns]
df = pd.concat([df]*100000).reset_index(drop=True)
In [129]: %timeit df['Grade']= np.where(df['Spending'] > 100 ,'A','B')
10 loops, best of 3: 21.6 ms per loop
In [130]: %timeit df['grade'] = df.apply(lambda row: 'A' if row['Spending'] > 100 else 'B', axis = 1)
1 loop, best of 3: 7.08 s per loop
Question taken from here: https://stackoverflow.com/a/41166160/3027854
I think np.where
is faster because use numpy array
vectorized way and pandas is built on this arrays.
df.apply
is slow, because it use loops
.
vectorize
operations are the fastest, then cython routines
and then apply
.
See this answer with better explanation of developer of pandas - Jeff
.
Just adding a visualization approach to what have been said.
Profile and total cumulative time of df.apply
:
We can see that the cimulative time is 13.8s
.
Profile and total cumulative time of np.where
:
Here, the cumulative time is 5.44ms
which is 2500
times faster than df.apply
The figure above were obtained using the library snakeviz
.
Here is a link to the library.
SnakeViz displays profiles as a sunburst in which functions are represented as arcs. A root function is a circle at the middle, with functions it calls around, then the functions those functions call, and so on. The amount of time spent inside a function is represented by the angular width of the arc. An arc that wraps most of the way around the circle represents a function that is taking up most of the time of its calling function, while a skinny arc represents a function that is using hardly any time at all.