How to minimize parameter in row pandas dataframe
I have dataframe with bus stop arrival forecast:
path_id | forecast | forecast_made_at | bus_id
int | datetime | datetime | int
We make predictions every 5 minutes, so database entries can be duplicated. for example
In 11:50 we predict bus #11544 will arrive at 11:59
In 11:50 we predict bus #95447 will arrive at 11:55
--......--
In 11:55 we predict bus #11544 will arrive at 12:02
I want to get newest prediction with biggest forecast_made_at parameter:
res = pd.DataFrame()
for k, row in t_data.iterrows():
prediction = dict(**row)
forecasts = t_data[t_data["bus_id"] == prediction["bus_id"]] # Forecasts with the same bus_id
prediction["best"] = (prediction["forecast_made_at"] == max(forecasts["forecast_made_at"]))
res = res.append(prediction, ignore_index=True)
res = res[res["best"] == True]
In this code, we are working with dictionaries and not with pandas objects, so this one is very slow. How can I do this using pandas tools
Solution 1:
What you need is a combination of grouping by bus_id
, sorting by date and selection of most recent row.
One option – dropping duplicates by bus_id
and only keeping most recent record:
t_data.sort_values('forecast_made_at').drop_duplicates(subset=['bus_id'], keep='last')
Another option: Grouping by bus_id
and selecting last record:
t_data.sort_values('forecast_made_at').groupby('bus_id').last().reset_index()
Solution 2:
Using this dataframe as an example
path_id forecast forecast_made_at bus_id
0 1 2018-01-01 14:10:00 2018-01-01 11:10:00 7
1 1 2018-01-01 14:10:00 2018-01-01 10:15:00 7
2 1 2018-01-01 14:10:00 2018-01-01 10:49:00 7
3 2 2018-09-10 03:05:00 2018-09-09 23:05:00 6
4 2 2018-09-10 03:05:00 2018-09-10 03:00:00 6
5 2 2018-09-10 03:05:00 2018-09-10 01:30:00 6
6 3 2018-04-21 17:32:00 2018-04-21 17:31:00 4
7 3 2018-04-21 17:32:00 2018-04-21 17:12:00 4
8 3 2018-04-21 17:32:00 2018-04-21 17:02:00 4
You can achieve that with the following
new_df = df.loc[df.groupby('forecast')['forecast_made_at'].idxmax()]
print(new_df)
path_id forecast forecast_made_at bus_id
0 1 2018-01-01 14:10:00 2018-01-01 11:10:00 7
6 3 2018-04-21 17:32:00 2018-04-21 17:31:00 4
4 2 2018-09-10 03:05:00 2018-09-10 03:00:00 6