Get value from previous column for each group in groupby

This is my df -

Site Product Period Inflow Outflow Production Opening Inventory New Opening Inventory Closing Inventory Production Needed
California Apples 1 0 3226 4300 1213 1213 0 0
California Apples 2 0 3279 3876 0 0 0 0
California Apples 3 0 4390 4530 0 0 0 0
California Apples 4 0 4281 3870 0 0 0 0
California Apples 5 0 4421 4393 0 0 0 0
California Oranges 1 0 505 400 0 0 0 0
California Oranges 2 0 278 505 0 0 0 0
California Oranges 3 0 167 278 0 0 0 0
California Oranges 4 0 124 167 0 0 0 0
California Oranges 5 0 106 124 0 0 0 0
Montreal Maple Syrup 1 0 445 465 293 293 0 0
Montreal Maple Syrup 2 0 82 398 0 0 0 0
Montreal Maple Syrup 3 0 745 346 0 0 0 0
Montreal Maple Syrup 4 0 241 363 0 0 0 0
Montreal Maple Syrup 5 0 189 254 0 0 0 0

As seen, there are three groups when grouped by Site and Product. For each of the three groups I want to do the following (for periods 2 to 5) -

  • Set New Opening Inventory to Closing Inventory of previous period
  • Calculate Closing Inventory for next period using the formula, Closing Inventory = Production + Inflow + New Opening Inventory - Outflow

I am trying to solve this using a combination of groupby and for loop

Here is what I have so far -

If df is a single group I can simply do

# calculate closing inventory of period 1
df['Closing Inventory'] = np.where(df['PeriodNo']==1, <formula>, 0)

for i in range(1, len(df)):
    df.loc[i, 'New Opening Inventory'] = df.loc[i-1, 'Closing Inventory']
    df.loc[i, 'Closing Inventory'] = df.loc[i, 'Production'] + df.loc[i, 'Inflow'] + df.loc[i, 'New Opening Inventory'] - df.loc[i, 'Outflow']

and when I try to nest this for loop in a loop over groups

# calculate closing inventory of period 1 for all groups
df['Closing Inventory'] = np.where(df['PeriodNo']==1, <formula>, 0)

g = df.groupby(['Site', 'Product']

alist = []

for k in g.groups.keys():
    temp = g.get_group(k).reset_index(drop=True)
    for i in range(1, len(temp)):
        temp.loc[i, 'New Opening Inventory'] = temp.loc[i-1, 'Closing Inventory']
        temp.loc[i, 'Closing Inventory'] = temp.loc[i, 'Production'] + temp.loc[i, 'Inflow'] + temp.loc[i, 'New Opening Inventory'] - temp.loc[i, 'Outflow']
    alist.append(temp)

df2 = pd.concat(alist, ignore_index=True)
df2

This solution works but seems very inefficient with the nested loop. Is there a better way to do this?


Solution 1:

Your New Opening Inventory is always previous Closing Inventory.

So I can modify this formula

Closing Inventory = Production + Inflow + New Opening Inventory - Outflow

to

Closing Inventory = Production + Inflow + Previous Closing Inventory - Outflow

For the first row, you do not have the Closing Inventory. But from the 2nd row, you calculate the Closing Inventory and you carry over the Closing Inventory to the next row.

Before obtaining the Closing Inventory, at first, calculate "Production" + "Inflow" - "Overflow" using list comprehension. The list comprehension performs better than for loop.

df['Closing Inventory'] = [x + y - z if p > 1 else 0 for p, x, y, z in zip(df['Period'], df['Production'], df['Inflow'], df['Outflow'])]

# df[['Site', 'Product', 'Closing Inventory']]
#         Site  Product Closing Inventory
# 0 California  Apples                  0
# 1 California  Apples                597
# 2 California  Apples                140
# 3 California  Apples               -411
# 4 California  Apples                -28
# 5 California  Oranges                 0
# 6 California  Oranges               227
# 7 California  Oranges               111
# ...

Then, the rest of formula is adding the previously calculated Closing Inventory, that means you can cumsum this result.

For row 1, Previous Closing (0) + calculated part (597) = 597
For row 2, Previous Closing (597) + calculated part (140) = 737
...
df['Closing Inventory'] = df.groupby(['Site', 'Product'])['Closing Inventory'].cumsum()

# df[['Site', 'Product', 'Closing Inventory']]
#         Site  Product Closing_Inventory
# 0 California  Apples                  0
# 1 California  Apples                597
# 2 California  Apples                737
# 3 California  Apples                326
# 4 California  Apples                298
# 5 California  Oranges                 0
# 6 California  Oranges               227
# 7 California  Oranges               338
# ...

Again the New Opening Inventory is always previous Closing Inventory except when period is 1. Hence first, shift the Closing Inventory then pick New Opening Inventory when period is 1.

I used combine_first to pick value from either New Opening or Closing Inventory.

df['New Opening Inventory'] = (df['New Opening Inventory'].replace(0, np.nan)
                               .combine_first(
                                   df.groupby(['Site', 'Product'])['Closing Inventory']
                                   .shift()
                                   .fillna(0)
                               ).astype(int))

Result

          Site  Product Period  New Opening Inventory Closing Inventory
0   California  Apples       1                   1213                 0
1   California  Apples       2                      0               597
2   California  Apples       3                    597               737
3   California  Apples       4                    737               326
4   California  Apples       5                    326               298
5   California  Oranges      1                      0                 0
6   California  Oranges      2                      0               227
7   California  Oranges      3                    227               338
...

With the sample data on my laptop,

Original solution: 8.44 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This solution: 2.95 ms ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I think with list comprehension and vectorization function, this solution can perform faster.