Performance of Pandas apply vs np.vectorize to create new column from existing columns
Solution 1:
I will start by saying that the power of Pandas and NumPy arrays is derived from high-performance vectorised calculations on numeric arrays.1 The entire point of vectorised calculations is to avoid Python-level loops by moving calculations to highly optimised C code and utilising contiguous memory blocks.2
Python-level loops
Now we can look at some timings. Below are all Python-level loops which produce either pd.Series
, np.ndarray
or list
objects containing the same values. For the purposes of assignment to a series within a dataframe, the results are comparable.
# Python 3.6.5, NumPy 1.14.3, Pandas 0.23.0
np.random.seed(0)
N = 10**5
%timeit list(map(divide, df['A'], df['B'])) # 43.9 ms
%timeit np.vectorize(divide)(df['A'], df['B']) # 48.1 ms
%timeit [divide(a, b) for a, b in zip(df['A'], df['B'])] # 49.4 ms
%timeit [divide(a, b) for a, b in df[['A', 'B']].itertuples(index=False)] # 112 ms
%timeit df.apply(lambda row: divide(*row), axis=1, raw=True) # 760 ms
%timeit df.apply(lambda row: divide(row['A'], row['B']), axis=1) # 4.83 s
%timeit [divide(row['A'], row['B']) for _, row in df[['A', 'B']].iterrows()] # 11.6 s
Some takeaways:
- The
tuple
-based methods (the first 4) are a factor more efficient thanpd.Series
-based methods (the last 3). -
np.vectorize
, list comprehension +zip
andmap
methods, i.e. the top 3, all have roughly the same performance. This is because they usetuple
and bypass some Pandas overhead frompd.DataFrame.itertuples
. - There is a significant speed improvement from using
raw=True
withpd.DataFrame.apply
versus without. This option feeds NumPy arrays to the custom function instead ofpd.Series
objects.
pd.DataFrame.apply
: just another loop
To see exactly the objects Pandas passes around, you can amend your function trivially:
def foo(row):
print(type(row))
assert False # because you only need to see this once
df.apply(lambda row: foo(row), axis=1)
Output: <class 'pandas.core.series.Series'>
. Creating, passing and querying a Pandas series object carries significant overheads relative to NumPy arrays. This shouldn't be surprise: Pandas series include a decent amount of scaffolding to hold an index, values, attributes, etc.
Do the same exercise again with raw=True
and you'll see <class 'numpy.ndarray'>
. All this is described in the docs, but seeing it is more convincing.
np.vectorize
: fake vectorisation
The docs for np.vectorize
has the following note:
The vectorized function evaluates
pyfunc
over successive tuples of the input arrays like the python map function, except it uses the broadcasting rules of numpy.
The "broadcasting rules" are irrelevant here, since the input arrays have the same dimensions. The parallel to map
is instructive, since the map
version above has almost identical performance. The source code shows what's happening: np.vectorize
converts your input function into a Universal function ("ufunc") via np.frompyfunc
. There is some optimisation, e.g. caching, which can lead to some performance improvement.
In short, np.vectorize
does what a Python-level loop should do, but pd.DataFrame.apply
adds a chunky overhead. There's no JIT-compilation which you see with numba
(see below). It's just a convenience.
True vectorisation: what you should use
Why aren't the above differences mentioned anywhere? Because the performance of truly vectorised calculations make them irrelevant:
%timeit np.where(df['B'] == 0, 0, df['A'] / df['B']) # 1.17 ms
%timeit (df['A'] / df['B']).replace([np.inf, -np.inf], 0) # 1.96 ms
Yes, that's ~40x faster than the fastest of the above loopy solutions. Either of these are acceptable. In my opinion, the first is succinct, readable and efficient. Only look at other methods, e.g. numba
below, if performance is critical and this is part of your bottleneck.
numba.njit
: greater efficiency
When loops are considered viable they are usually optimised via numba
with underlying NumPy arrays to move as much as possible to C.
Indeed, numba
improves performance to microseconds. Without some cumbersome work, it will be difficult to get much more efficient than this.
from numba import njit
@njit
def divide(a, b):
res = np.empty(a.shape)
for i in range(len(a)):
if b[i] != 0:
res[i] = a[i] / b[i]
else:
res[i] = 0
return res
%timeit divide(df['A'].values, df['B'].values) # 717 µs
Using @njit(parallel=True)
may provide a further boost for larger arrays.
1 Numeric types include: int
, float
, datetime
, bool
, category
. They exclude object
dtype and can be held in contiguous memory blocks.
2 There are at least 2 reasons why NumPy operations are efficient versus Python:
- Everything in Python is an object. This includes, unlike C, numbers. Python types therefore have an overhead which does not exist with native C types.
- NumPy methods are usually C-based. In addition, optimised algorithms are used where possible.
Solution 2:
The more complex your functions get (i.e., the less numpy
can move to its own internals), the more you will see that the performance won't be that different. For example:
name_series = pd.Series(np.random.choice(['adam', 'chang', 'eliza', 'odom'], replace=True, size=100000))
def parse_name(name):
if name.lower().startswith('a'):
return 'A'
elif name.lower().startswith('e'):
return 'E'
elif name.lower().startswith('i'):
return 'I'
elif name.lower().startswith('o'):
return 'O'
elif name.lower().startswith('u'):
return 'U'
return name
parse_name_vec = np.vectorize(parse_name)
Doing some timings:
Using Apply
%timeit name_series.apply(parse_name)
Results:
76.2 ms ± 626 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using np.vectorize
%timeit parse_name_vec(name_series)
Results:
77.3 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numpy tries to turn python functions into numpy ufunc
objects when you call np.vectorize
. How it does this, I don't actually know - you'd have to dig more into the internals of numpy than I'm willing to ATM. That said, it seems to do a better job on simply numerical functions than this string-based function here.
Cranking the size up to 1,000,000:
name_series = pd.Series(np.random.choice(['adam', 'chang', 'eliza', 'odom'], replace=True, size=1000000))
apply
%timeit name_series.apply(parse_name)
Results:
769 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
np.vectorize
%timeit parse_name_vec(name_series)
Results:
794 ms ± 4.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
A better (vectorized) way with np.select
:
cases = [
name_series.str.lower().str.startswith('a'), name_series.str.lower().str.startswith('e'),
name_series.str.lower().str.startswith('i'), name_series.str.lower().str.startswith('o'),
name_series.str.lower().str.startswith('u')
]
replacements = 'A E I O U'.split()
Timings:
%timeit np.select(cases, replacements, default=name_series)
Results:
67.2 ms ± 683 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)