Pandas - Explanation on apply function being slow
Apply function seems to work very slow with a large dataframe (about 1~3 million rows).
I have checked related questions here, like Speed up Pandas apply function, and Counting within pandas apply() function, it seems the best way to speed it up is not to use apply function :)
For my case, I have two kinds of tasks to do with the apply function.
First: apply with lookup dict query
f(p_id, p_dict):
return p_dict[p_dict['ID'] == p_id]['value']
p_dict = DataFrame(...) # it's another dict works like lookup table
df = df.apply(f, args=(p_dict,))
Second: apply with groupby
f(week_id, min_week_num, p_dict):
return p_dict[(week_id - min_week_num < p_dict['WEEK']) & (p_dict['WEEK'] < week_id)].ix[:,2].mean()
f_partial = partial(f, min_week_num=min_week_num, p_dict=p_dict)
df = map(f, df['WEEK'])
I guess for the fist case, it could be done with dataframe join, while I am not sure about resource cost for such join on a large dataset.
My question is:
- Is there any way to substitute apply in the two above cases?
- Why is apply so slow? For the dict lookup case, I think it should be O(N), it shouldn't cost that much even if N is 1 million.
Concerning your first question, I can't say exactly why this instance is slow. But generally, apply
does not take advantage of vectorization. Also, apply
returns a new Series or DataFrame object, so with a very large DataFrame, you have considerable IO overhead (I cannot guarantee this is the case 100% of the time since Pandas has loads of internal implementation optimization).
For your first method, I assume you are trying to fill a 'value' column in df
using the p_dict
as a lookup table. It is about 1000x faster to use pd.merge
:
import string, sys
import numpy as np
import pandas as pd
##
# Part 1 - filling a column by a lookup table
##
def f1(col, p_dict):
return [p_dict[p_dict['ID'] == s]['value'].values[0] for s in col]
# Testing
n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'ID': [s for s in string.ascii_uppercase], 'value': np.random.randint(0,n_size, 26)})
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})
# Apply the f1 method as posted
%timeit -n1 -r5 temp = df.apply(f1, args=(p_dict,))
>>> 1 loops, best of 5: 832 ms per loop
# Using merge
np.random.seed(997)
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})
%timeit -n1 -r5 temp = pd.merge(df, p_dict, how='inner', left_on='p_id', right_on='ID', copy=False)
>>> 1000 loops, best of 5: 826 µs per loop
Concerning the second task, we can quickly add a new column to p_dict
that calculates a mean where the time window starts at min_week_num
and ends at the week for that row in p_dict
. This requires that p_dict
is sorted by ascending order along the WEEK
column. Then you can use pd.merge
again.
I am assuming that min_week_num
is 0 in the following example. But you could easily modify rolling_growing_mean
to take a different value. The rolling_growing_mean
method will run in O(n) since it conducts a fixed number of operations per iteration.
n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'WEEK': range(52), 'value': np.random.randint(0, 1000, 52)})
df = pd.DataFrame({'WEEK': np.random.randint(0, 52, n_size)})
def rolling_growing_mean(values):
out = np.empty(len(values))
out[0] = values[0]
# Time window for taking mean grows each step
for i, v in enumerate(values[1:]):
out[i+1] = np.true_divide(out[i]*(i+1) + v, i+2)
return out
p_dict['Means'] = rolling_growing_mean(p_dict['value'])
df_merged = pd.merge(df, p_dict, how='inner', left_on='WEEK', right_on='WEEK')