Pandas Groupby: get value from previous element of a group based on value of another column

I have a data frame with 4 columns. I have sorted this data frame by 'group' and 'timestamp' beforehand.

df = pd.DataFrame(
{
    "type": ['type0', 'type1', 'type2', 'type3', 'type1', 'type3', 'type0', 'type1', 'type3', 'type3'],
    "group": [1, 1, 1, 1, 1, 1, 2, 2, 2, 2],
    "timestamp": ["20220105 07:52:46", "20220105 07:53:11", "20220105 07:53:55", "20220105 07:59:12", "20220105 08:24:13", "20220105 08:48:19", "20220105 11:01:30", "20220105 11:15:16", "20220105 12:13:36", "20220105 12:19:44"],
    "price": [0, 1.5, 2.5, 3, 3.2, 3.1, 0.5, 3, 3.25, pd.NA]
})

>> df
    type  group          timestamp price
0  type0      1  20220105 07:52:46     0
1  type1      1  20220105 07:53:11   1.5
2  type2      1  20220105 07:53:55   2.5
3  type3      1  20220105 07:59:12     3
4  type1      1  20220105 08:24:13   3.2
5  type3      1  20220105 08:48:19   3.1
6  type0      2  20220105 11:01:30   0.5
7  type1      2  20220105 11:15:16     3
8  type3      2  20220105 12:13:36  3.25
9  type3      2  20220105 12:19:44  <NA>

After grouping by the column 'group', I want to create a 'new_price' column as per the following logic:
For each 'type3' row in a group (i.e., df['type'] = 'type3'), get the price from the PREVIOUS 'type1' or 'type2' row in the group. For type0/type1/type2 rows, keep the same price as in the input data frame.

My Solution:

My solution below works when we don't have 2 consecutive 'type3' rows. But when there are 2 consecutive 'type3' rows, I get the wrong price for the second 'type3' row. I want the price from the previous 'type1' or 'type2' row in the group, but I get the price from the first 'type3' row using my solution.

df = df.sort_values(by=["group", "timestamp"])
required_types_mask = df['type'].isin(['type1', 'type2', 'type3'])
temp_series = df.loc[:, 'price'].where(required_types_mask).groupby(df['group']).shift(1)
type_3_mask = df['type'].eq('type3')
df.loc[:, 'new_price'] = df.loc[:, 'price'].mask(type_3_mask, temp_series)

My result:

    type  group          timestamp price new_price
0  type0      1  20220105 07:52:46     0         0
1  type1      1  20220105 07:53:11   1.5       1.5
2  type2      1  20220105 07:53:55   2.5       2.5
3  type3      1  20220105 07:59:12     3       2.5
4  type1      1  20220105 08:24:13   3.2       3.2
5  type3      1  20220105 08:48:19   3.1       3.2
6  type0      2  20220105 11:01:30   0.5       0.5
7  type1      2  20220105 11:15:16     3         3
8  type3      2  20220105 12:13:36  3.25         3
9  type3      2  20220105 12:19:44  <NA>       3.25 <- Incorrect price

Expected result:

    type  group          timestamp price new_price
0  type0      1  20220105 07:52:46     0         0
1  type1      1  20220105 07:53:11   1.5       1.5
2  type2      1  20220105 07:53:55   2.5       2.5
3  type3      1  20220105 07:59:12     3       2.5
4  type1      1  20220105 08:24:13   3.2       3.2
5  type3      1  20220105 08:48:19   3.1       3.2
6  type0      2  20220105 11:01:30   0.5       0.5
7  type1      2  20220105 11:15:16     3         3
8  type3      2  20220105 12:13:36  3.25         3
9  type3      2  20220105 12:19:44  <NA>         3 <- Correct price

Solution 1:

We can mask the price with type3 then ffill

s = df.price.mask(df.type.isin(['type0','type3']))
df['new'] = np.where(df.type.eq('type3'),s.groupby(df['group']).ffill(),df['price'])
df
    type  group          timestamp price  new
0  type0      1  20220105 07:52:46     0    0
1  type1      1  20220105 07:53:11   1.5  1.5
2  type2      1  20220105 07:53:55   2.5  2.5
3  type3      1  20220105 07:59:12     3  2.5
4  type1      1  20220105 08:24:13   3.2  3.2
5  type3      1  20220105 08:48:19   3.1  3.2
6  type0      2  20220105 11:01:30   0.5  0.5
7  type1      2  20220105 11:15:16     3    3
8  type3      2  20220105 12:13:36  3.25    3
9  type3      2  20220105 12:19:44  <NA>    3

Solution 2:

You can use a series of masks to ffill.

First mask 'type3' and 'type0' (the latter to avoid using it as source to ffill). Then restore the values of 'type0'.

All is done per group.

df['new_price'] = (
 df.groupby('group')
   .apply(lambda d: d['price']
            .mask(d['type'].isin(['type3', 'type0'])) # type0/3 to NaN
            .ffill()                                  # fill with previous type1/2
            .mask(d['type'].eq('type0'), d['price'])  # restore type0
         )
   .values
 )

output:

    type  group          timestamp price new_price
0  type0      1  20220105 07:52:46     0         0
1  type1      1  20220105 07:53:11   1.5       1.5
2  type2      1  20220105 07:53:55   2.5       2.5
3  type3      1  20220105 07:59:12     3       2.5
4  type1      1  20220105 08:24:13   3.2       3.2
5  type3      1  20220105 08:48:19   3.1       3.2
6  type0      2  20220105 11:01:30   0.5       0.5
7  type1      2  20220105 11:15:16     3       3.0
8  type3      2  20220105 12:13:36  3.25       3.0
9  type3      2  20220105 12:19:44  <NA>       3.0

Pandas Groupby: get value from previous element of a group based on value of another column

Solution 1:

Solution 2:

Related

Recent Posts