Why apply sometimes isn't faster than for-loop in pandas dataframe?
It is my understanding that .apply
is not generally faster than iteration over the axis. I believe underneath the hood it is merely a loop over the axis, except you are incurring the overhead of a function call each time in this case.
If we look at the source code, we can see that essentially we are iterating over the indicated axis and applying the function, building the individual results as series into a dictionary, and the finally calling the dataframe constructor on the dictionary returning a new DataFrame:
if axis == 0:
series_gen = (self._ixs(i, axis=1)
for i in range(len(self.columns)))
res_index = self.columns
res_columns = self.index
elif axis == 1:
res_index = self.index
res_columns = self.columns
values = self.values
series_gen = (Series.from_array(arr, index=res_columns, name=name,
dtype=dtype)
for i, (arr, name) in enumerate(zip(values,
res_index)))
else: # pragma : no cover
raise AssertionError('Axis must be 0 or 1, got %s' % str(axis))
i = None
keys = []
results = {}
if ignore_failures:
successes = []
for i, v in enumerate(series_gen):
try:
results[i] = func(v)
keys.append(v.name)
successes.append(i)
except Exception:
pass
# so will work with MultiIndex
if len(successes) < len(res_index):
res_index = res_index.take(successes)
else:
try:
for i, v in enumerate(series_gen):
results[i] = func(v)
keys.append(v.name)
except Exception as e:
if hasattr(e, 'args'):
# make sure i is defined
if i is not None:
k = res_index[i]
e.args = e.args + ('occurred at index %s' %
pprint_thing(k), )
raise
if len(results) > 0 and is_sequence(results[0]):
if not isinstance(results[0], Series):
index = res_columns
else:
index = None
result = self._constructor(data=results, index=index)
result.columns = res_index
if axis == 1:
result = result.T
result = result._convert(datetime=True, timedelta=True, copy=False)
else:
result = Series(results)
result.index = res_index
return result
Specifically:
for i, v in enumerate(series_gen):
results[i] = func(v)
keys.append(v.name)
Where series_gen
was constructed based on the requested axis.
To get more performance out of a function, you can follow the advice given here.
Essentially, your options are:
- Write a C extension
- Use
numba
(a JIT compiler) - Use
pandas.eval
to squeeze performance out of large Dataframes