Pandas interpolate within a groupby
Solution 1:
>>> df.groupby('filename').apply(lambda group: group.interpolate(method='index'))
filename val1 val2
t
1 file1.csv 5 10
2 file1.csv 10 15
3 file1.csv 15 20
6 file2.csv NaN NaN
7 file2.csv 10 20
8 file2.csv 12 15
Solution 2:
I ran into this as well. Instead of using apply
, you can use transform
, which will reduce your run time by more than 25% if you have on the order of 1000 groups:
import numpy as np
import pandas as pd
np.random.seed(500)
test_df = pd.DataFrame({
'a': np.random.randint(low=0, high=1000, size=10000),
'b': np.random.choice([1, 2, 4, 7, np.nan], size=10000, p=([0.2475]*4 + [0.01]))
})
Tests:
%timeit test_df.groupby('a').transform(pd.DataFrame.interpolate)
Output: 566 ms ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit test_df.groupby('a').apply(pd.DataFrame.interpolate)
Output: 788 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit test_df.groupby('a').apply(lambda group: group.interpolate())
Output: 787 ms ± 17.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit test_df.interpolate()
Output: 918 µs ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You will still see a significant increase in run-time compared to a fully vectorized call to interpolate
on the full DataFrame, but I don't think you can do much better in pandas.
Solution 3:
Considering the long running time of above methods, I suggest use a for loop and interpolate(), which is no more than few lines of codes, but much faster in speed.
for i in range(len(df.filename.unique())):
mask = df.loc[:,'filename']==df.filename.unique()[i]
df[mask]=dfs[mask].interpolate(method='index')