Pandas collapse multiple rows into one by selecting the most probable

Problem

My first DataFrame consists of the following:

df1 = pd.DataFrame(
    [[1, 'C_A'], [2, 'C_B'], [3, 'C_C'], [3, 'C_D'], [4, 'C_F']],
    columns=['time', 'category']
)

   time category
0     1      C_A
1     2      C_B
2     3      C_C
3     3      C_D
4     4      C_F

There are some rows that have multiple time entries.

Now my second df is a probability distribution at each time for each category:

df2 = pd.DataFrame(
    [[1., 0., 0., 0., 0.], [0., 1., 0., 0., 0.],
     [0., 0., 0.7, 0.3, 0.], [1, 0., 0., 0., 0.]],
    columns=['C_A', 'C_B', 'C_C', 'C_D', 'C_F']
)

   C_A  C_B  C_C  C_D  C_F
0  1.0  0.0  0.0  0.0  0.0
1  0.0  1.0  0.0  0.0  0.0
2  0.0  0.0  0.7  0.3  0.0
3  1.0  0.0  0.0  0.0  0.0

For the rows with one single time, I want to keep them as they are.

But, for example, for the 3rd and 4th row of df1 I'd like to keep only 1 of the multiple rows, to be the one with the highest probability.

The final result would be:

pd.DataFrame(
    [[1, 'C_A'], [2, 'C_B'], [3, 'C_C'], [4, 'C_F']],
    columns=['time', 'category']
)

   time category
0     1      C_A
1     2      C_B
2     3      C_C
3     4      C_F

Question

How can I collapse these multiple rows when they do appear, and keep the row with the greatest probability based on my other DataFrame?


Solution 1:

You want to extract the value (probability) from df2 using df1.time as index and df2.category as columns. Then, you can sort the values and drop duplicates.

The first step is basically a case of the deprecated lookup function. You can use numpy indexing instead:

row_idx, col_idx = df2.index.get_indexer(df1.time-1), df2.columns.get_indexer(df1.category)

(df1.assign(prob=df2.to_numpy()[row_idx, col_idx])
    .sort_values(['time', 'prob'])
    .drop_duplicates('time', keep='last')
)

Output:

   time category  prob
0     1      C_A   1.0
1     2      C_B   1.0
2     3      C_C   0.7
4     4      C_F   0.0