Why apply sometimes isn't faster than for-loop in pandas dataframe?

It is my understanding that .apply is not generally faster than iteration over the axis. I believe underneath the hood it is merely a loop over the axis, except you are incurring the overhead of a function call each time in this case.

If we look at the source code, we can see that essentially we are iterating over the indicated axis and applying the function, building the individual results as series into a dictionary, and the finally calling the dataframe constructor on the dictionary returning a new DataFrame:

    if axis == 0:
        series_gen = (self._ixs(i, axis=1)
                      for i in range(len(self.columns)))
        res_index = self.columns
        res_columns = self.index
    elif axis == 1:
        res_index = self.index
        res_columns = self.columns
        values = self.values
        series_gen = (Series.from_array(arr, index=res_columns, name=name,
                                        dtype=dtype)
                      for i, (arr, name) in enumerate(zip(values,
                                                          res_index)))
    else:  # pragma : no cover
        raise AssertionError('Axis must be 0 or 1, got %s' % str(axis))

    i = None
    keys = []
    results = {}
    if ignore_failures:
        successes = []
        for i, v in enumerate(series_gen):
            try:
                results[i] = func(v)
                keys.append(v.name)
                successes.append(i)
            except Exception:
                pass
        # so will work with MultiIndex
        if len(successes) < len(res_index):
            res_index = res_index.take(successes)
    else:
        try:
            for i, v in enumerate(series_gen):
                results[i] = func(v)
                keys.append(v.name)
        except Exception as e:
            if hasattr(e, 'args'):
                # make sure i is defined
                if i is not None:
                    k = res_index[i]
                    e.args = e.args + ('occurred at index %s' %
                                       pprint_thing(k), )
            raise

    if len(results) > 0 and is_sequence(results[0]):
        if not isinstance(results[0], Series):
            index = res_columns
        else:
            index = None

        result = self._constructor(data=results, index=index)
        result.columns = res_index

        if axis == 1:
            result = result.T
        result = result._convert(datetime=True, timedelta=True, copy=False)

    else:

        result = Series(results)
        result.index = res_index

    return result

Specifically:

for i, v in enumerate(series_gen):
                results[i] = func(v)
                keys.append(v.name)

Where series_gen was constructed based on the requested axis.

To get more performance out of a function, you can follow the advice given here.

Essentially, your options are:

Write a C extension
Use numba (a JIT compiler)
Use pandas.eval to squeeze performance out of large Dataframes

How do you do date math that ignores the year?

Thread that I can pause and resume?

How do I get the actual Monitor name? as seen in the resolution dialog

Regular expression for all printable characters in JavaScript

What happens to a float variable when %d is used in a printf?

Deciphering variable information while debugging Java

Converting newtonsoft code to System.Text.Json in .net core 3. what's equivalent of JObject.Parse and JsonProperty

Slowdown due to non-parallel awaiting of promises in async generators

Continuous probability distribution with no first moment but the characteristic function is differentiable

Shortest path on a sphere

If these two expressions for calculating the prime counting function are equal, why doesn't this work?

the discriminant of the cyclotomic $\Phi_p(x)$

Why apply sometimes isn't faster than for-loop in pandas dataframe?

Related

Recent Posts