Get value from previous column for each group in groupby
This is my df
-
Site | Product | Period | Inflow | Outflow | Production | Opening Inventory | New Opening Inventory | Closing Inventory | Production Needed |
---|---|---|---|---|---|---|---|---|---|
California | Apples | 1 | 0 | 3226 | 4300 | 1213 | 1213 | 0 | 0 |
California | Apples | 2 | 0 | 3279 | 3876 | 0 | 0 | 0 | 0 |
California | Apples | 3 | 0 | 4390 | 4530 | 0 | 0 | 0 | 0 |
California | Apples | 4 | 0 | 4281 | 3870 | 0 | 0 | 0 | 0 |
California | Apples | 5 | 0 | 4421 | 4393 | 0 | 0 | 0 | 0 |
California | Oranges | 1 | 0 | 505 | 400 | 0 | 0 | 0 | 0 |
California | Oranges | 2 | 0 | 278 | 505 | 0 | 0 | 0 | 0 |
California | Oranges | 3 | 0 | 167 | 278 | 0 | 0 | 0 | 0 |
California | Oranges | 4 | 0 | 124 | 167 | 0 | 0 | 0 | 0 |
California | Oranges | 5 | 0 | 106 | 124 | 0 | 0 | 0 | 0 |
Montreal | Maple Syrup | 1 | 0 | 445 | 465 | 293 | 293 | 0 | 0 |
Montreal | Maple Syrup | 2 | 0 | 82 | 398 | 0 | 0 | 0 | 0 |
Montreal | Maple Syrup | 3 | 0 | 745 | 346 | 0 | 0 | 0 | 0 |
Montreal | Maple Syrup | 4 | 0 | 241 | 363 | 0 | 0 | 0 | 0 |
Montreal | Maple Syrup | 5 | 0 | 189 | 254 | 0 | 0 | 0 | 0 |
As seen, there are three groups when grouped by Site
and Product
. For each of the three groups I want to do the following (for periods 2 to 5) -
- Set
New Opening Inventory
toClosing Inventory
of previous period - Calculate
Closing Inventory
for next period using the formula,Closing Inventory
=Production
+Inflow
+New Opening Inventory
-Outflow
I am trying to solve this using a combination of groupby
and for loop
Here is what I have so far -
If df
is a single group I can simply do
# calculate closing inventory of period 1
df['Closing Inventory'] = np.where(df['PeriodNo']==1, <formula>, 0)
for i in range(1, len(df)):
df.loc[i, 'New Opening Inventory'] = df.loc[i-1, 'Closing Inventory']
df.loc[i, 'Closing Inventory'] = df.loc[i, 'Production'] + df.loc[i, 'Inflow'] + df.loc[i, 'New Opening Inventory'] - df.loc[i, 'Outflow']
and when I try to nest this for loop
in a loop over groups
# calculate closing inventory of period 1 for all groups
df['Closing Inventory'] = np.where(df['PeriodNo']==1, <formula>, 0)
g = df.groupby(['Site', 'Product']
alist = []
for k in g.groups.keys():
temp = g.get_group(k).reset_index(drop=True)
for i in range(1, len(temp)):
temp.loc[i, 'New Opening Inventory'] = temp.loc[i-1, 'Closing Inventory']
temp.loc[i, 'Closing Inventory'] = temp.loc[i, 'Production'] + temp.loc[i, 'Inflow'] + temp.loc[i, 'New Opening Inventory'] - temp.loc[i, 'Outflow']
alist.append(temp)
df2 = pd.concat(alist, ignore_index=True)
df2
This solution works but seems very inefficient with the nested loop. Is there a better way to do this?
Solution 1:
Your New Opening Inventory is always previous Closing Inventory.
So I can modify this formula
Closing Inventory = Production + Inflow + New Opening Inventory - Outflow
to
Closing Inventory = Production + Inflow + Previous Closing Inventory - Outflow
For the first row, you do not have the Closing Inventory. But from the 2nd row, you calculate the Closing Inventory and you carry over the Closing Inventory to the next row.
Before obtaining the Closing Inventory, at first, calculate "Production" + "Inflow" - "Overflow" using list comprehension. The list comprehension performs better than for loop.
df['Closing Inventory'] = [x + y - z if p > 1 else 0 for p, x, y, z in zip(df['Period'], df['Production'], df['Inflow'], df['Outflow'])]
# df[['Site', 'Product', 'Closing Inventory']]
# Site Product Closing Inventory
# 0 California Apples 0
# 1 California Apples 597
# 2 California Apples 140
# 3 California Apples -411
# 4 California Apples -28
# 5 California Oranges 0
# 6 California Oranges 227
# 7 California Oranges 111
# ...
Then, the rest of formula is adding the previously calculated Closing Inventory, that means you can cumsum
this result.
For row 1, Previous Closing (0) + calculated part (597) = 597
For row 2, Previous Closing (597) + calculated part (140) = 737
...
df['Closing Inventory'] = df.groupby(['Site', 'Product'])['Closing Inventory'].cumsum()
# df[['Site', 'Product', 'Closing Inventory']]
# Site Product Closing_Inventory
# 0 California Apples 0
# 1 California Apples 597
# 2 California Apples 737
# 3 California Apples 326
# 4 California Apples 298
# 5 California Oranges 0
# 6 California Oranges 227
# 7 California Oranges 338
# ...
Again the New Opening Inventory is always previous Closing Inventory except when period is 1. Hence first, shift the Closing Inventory then pick New Opening Inventory when period is 1.
I used combine_first
to pick value from either New Opening or Closing Inventory.
df['New Opening Inventory'] = (df['New Opening Inventory'].replace(0, np.nan)
.combine_first(
df.groupby(['Site', 'Product'])['Closing Inventory']
.shift()
.fillna(0)
).astype(int))
Result
Site Product Period New Opening Inventory Closing Inventory
0 California Apples 1 1213 0
1 California Apples 2 0 597
2 California Apples 3 597 737
3 California Apples 4 737 326
4 California Apples 5 326 298
5 California Oranges 1 0 0
6 California Oranges 2 0 227
7 California Oranges 3 227 338
...
With the sample data on my laptop,
Original solution: 8.44 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This solution: 2.95 ms ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I think with list comprehension and vectorization function, this solution can perform faster.