Pandas collapse multiple rows into one by selecting the most probable
Problem
My first DataFrame consists of the following:
df1 = pd.DataFrame(
[[1, 'C_A'], [2, 'C_B'], [3, 'C_C'], [3, 'C_D'], [4, 'C_F']],
columns=['time', 'category']
)
time category
0 1 C_A
1 2 C_B
2 3 C_C
3 3 C_D
4 4 C_F
There are some rows that have multiple time entries.
Now my second df is a probability distribution at each time for each category:
df2 = pd.DataFrame(
[[1., 0., 0., 0., 0.], [0., 1., 0., 0., 0.],
[0., 0., 0.7, 0.3, 0.], [1, 0., 0., 0., 0.]],
columns=['C_A', 'C_B', 'C_C', 'C_D', 'C_F']
)
C_A C_B C_C C_D C_F
0 1.0 0.0 0.0 0.0 0.0
1 0.0 1.0 0.0 0.0 0.0
2 0.0 0.0 0.7 0.3 0.0
3 1.0 0.0 0.0 0.0 0.0
For the rows with one single time, I want to keep them as they are.
But, for example, for the 3rd and 4th row of df1
I'd like to keep only 1 of the multiple rows, to be the one with
the highest probability.
The final result would be:
pd.DataFrame(
[[1, 'C_A'], [2, 'C_B'], [3, 'C_C'], [4, 'C_F']],
columns=['time', 'category']
)
time category
0 1 C_A
1 2 C_B
2 3 C_C
3 4 C_F
Question
How can I collapse these multiple rows when they do appear, and keep the row with the greatest probability based on my other DataFrame?
Solution 1:
You want to extract the value (probability) from df2
using df1.time
as index and df2.category
as columns. Then, you can sort the values and drop duplicates.
The first step is basically a case of the deprecated lookup
function. You can use numpy indexing instead:
row_idx, col_idx = df2.index.get_indexer(df1.time-1), df2.columns.get_indexer(df1.category)
(df1.assign(prob=df2.to_numpy()[row_idx, col_idx])
.sort_values(['time', 'prob'])
.drop_duplicates('time', keep='last')
)
Output:
time category prob
0 1 C_A 1.0
1 2 C_B 1.0
2 3 C_C 0.7
4 4 C_F 0.0