Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?
I have the following dataframe:
Index_Date A B C D
===============================
2015-01-31 10 10 Nan 10
2015-02-01 2 3 Nan 22
2015-02-02 10 60 Nan 280
2015-02-03 10 100 Nan 250
Require:
Index_Date A B C D
===============================
2015-01-31 10 10 10 10
2015-02-01 2 3 23 22
2015-02-02 10 60 290 280
2015-02-03 10 100 3000 250
Column C
is derived for 2015-01-31
by taking value
of D
.
Then I need to use the value
of C
for 2015-01-31
and multiply by the value
of A
on 2015-02-01
and add B
.
I have attempted an apply
and a shift
using an if else
by this gives a key error.
First, create the derived value:
df.loc[0, 'C'] = df.loc[0, 'D']
Then iterate through the remaining rows and fill the calculated values:
for i in range(1, len(df)):
df.loc[i, 'C'] = df.loc[i-1, 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']
Index_Date A B C D
0 2015-01-31 10 10 10 10
1 2015-02-01 2 3 23 22
2 2015-02-02 10 60 290 280
Given a column of numbers:
lst = []
cols = ['A']
for a in range(100, 105):
lst.append([a])
df = pd.DataFrame(lst, columns=cols, index=range(5))
df
A
0 100
1 101
2 102
3 103
4 104
You can reference the previous row with shift:
df['Change'] = df.A - df.A.shift(1)
df
A Change
0 100 NaN
1 101 1.0
2 102 1.0
3 103 1.0
4 104 1.0
numba
For recursive calculations which are not vectorisable, numba
, which uses JIT-compilation and works with lower level objects, often yields large performance improvements. You need only define a regular for
loop and use the decorator @njit
or (for older versions) @jit(nopython=True)
:
For a reasonable size dataframe, this gives a ~30x performance improvement versus a regular for
loop:
from numba import jit
@jit(nopython=True)
def calculator_nb(a, b, d):
res = np.empty(d.shape)
res[0] = d[0]
for i in range(1, res.shape[0]):
res[i] = res[i-1] * a[i] + b[i]
return res
df['C'] = calculator_nb(*df[list('ABD')].values.T)
n = 10**5
df = pd.concat([df]*n, ignore_index=True)
# benchmarking on Python 3.6.0, Pandas 0.19.2, NumPy 1.11.3, Numba 0.30.1
# calculator() is same as calculator_nb() but without @jit decorator
%timeit calculator_nb(*df[list('ABD')].values.T) # 14.1 ms per loop
%timeit calculator(*df[list('ABD')].values.T) # 444 ms per loop
Applying the recursive function on numpy arrays will be faster than the current answer.
df = pd.DataFrame(np.repeat(np.arange(2, 6),3).reshape(4,3), columns=['A', 'B', 'D'])
new = [df.D.values[0]]
for i in range(1, len(df.index)):
new.append(new[i-1]*df.A.values[i]+df.B.values[i])
df['C'] = new
Output
A B D C
0 1 1 1 1
1 2 2 2 4
2 3 3 3 15
3 4 4 4 64
4 5 5 5 325
Although it has been a while since this question was asked, I will post my answer hoping it helps somebody.
Disclaimer: I know this solution is not standard, but I think it works well.
import pandas as pd
import numpy as np
data = np.array([[10, 2, 10, 10],
[10, 3, 60, 100],
[np.nan] * 4,
[10, 22, 280, 250]]).T
idx = pd.date_range('20150131', end='20150203')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df
A B C D
=================================
2015-01-31 10 10 NaN 10
2015-02-01 2 3 NaN 22
2015-02-02 10 60 NaN 280
2015-02-03 10 100 NaN 250
def calculate(mul, add):
global value
value = value * mul + add
return value
value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)
df
A B C D
=================================
2015-01-31 10 10 10 10
2015-02-01 2 3 23 22
2015-02-02 10 60 290 280
2015-02-03 10 100 3000 250
So basically we use a apply
from pandas and the help of a global variable that keeps track of the previous calculated value.
Time comparison with a for
loop:
data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan
df.loc['2015-01-31', 'C'] = df.loc['2015-01-31', 'D']
%%timeit
for i in df.loc['2015-02-01':].index.date:
df.loc[i, 'C'] = df.loc[(i - pd.DateOffset(days=1)).date(), 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']
3.2 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan
def calculate(mul, add):
global value
value = value * mul + add
return value
value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value
%%timeit
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)
1.82 s ± 64.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So 0.57 times faster on average.