Why is np.where faster than pd.apply

Sample code is here

import pandas as pd
import numpy as np

df = pd.DataFrame({'Customer' : ['Bob', 'Ken', 'Steve', 'Joe'],
                   'Spending' : [130,22,313,46]})

#[400000 rows x 4 columns]
df = pd.concat([df]*100000).reset_index(drop=True)

In [129]: %timeit df['Grade']= np.where(df['Spending'] > 100 ,'A','B')
10 loops, best of 3: 21.6 ms per loop

In [130]: %timeit df['grade'] = df.apply(lambda row: 'A' if row['Spending'] > 100 else 'B', axis = 1)
1 loop, best of 3: 7.08 s per loop

Question taken from here: https://stackoverflow.com/a/41166160/3027854

I think np.where is faster because use numpy array vectorized way and pandas is built on this arrays.

df.apply is slow, because it use loops.

vectorize operations are the fastest, then cython routines and then apply.

See this answer with better explanation of developer of pandas - Jeff.

Just adding a visualization approach to what have been said.

Profile and total cumulative time of df.apply : df.apply profile

We can see that the cimulative time is 13.8s.

Profile and total cumulative time of np.where : np.where profile

Here, the cumulative time is 5.44ms which is 2500 times faster than df.apply

The figure above were obtained using the library snakeviz. Here is a link to the library.

SnakeViz displays profiles as a sunburst in which functions are represented as arcs. A root function is a circle at the middle, with functions it calls around, then the functions those functions call, and so on. The amount of time spent inside a function is represented by the angular width of the arc. An arc that wraps most of the way around the circle represents a function that is taking up most of the time of its calling function, while a skinny arc represents a function that is using hardly any time at all.

Why is np.where faster than pd.apply

Related

Recent Posts