Categorize a datetime interval based on other datetime interval and put values on columns
I'm dealing with a hard challenge and I don't know how to solve it.
I have a dataframe like this:
Product_Name Start_Time End_Time
Product X 2021-10-20 20:32:00 2021-10-21 03:50:00
Product Y 2021-10-21 11:50:00 2021-10-21 16:00:00
Product Z 2022-01-11 20:10:00 2022-01-12 15:30:00
And I have 3 range time and a category for each one:
A: 05:01 to 14:00
B: 14:01 to 22:00
C: 22:01 to 05:00
What I want to do is calculate how much decimal hours each category (A,B and C) have based on "Start_Time" and "End_Time", reaching some like this:
Product_Name Start_Time End_Time A B C
Product X 2021-10-20 20:30:00 2021-10-21 03:50:00 0.00 1.50 5.82
Product Y 2021-10-21 11:50:00 2021-10-21 16:00:00 2.17 1.98 0.00
Product Z 2022-01-11 20:10:00 2022-01-12 15:30:00 8.98 3.31 6.98
Could you guys help me how to do it?
I'm a really beginner in python, pandas etc, and when I wrote the post first time I really had no ideia how to even start to code it. So I start to think in something and I get this code, I'm sure it's not right, but I think it's a start of something:
start_a = 05:01:00
end_a = 14:00:00
start_b = 14:01:00
end_b = 22:00:00
start_c = 22:01:00
end_c = 05:00:00
if df['Start_Time'] > start_a and df['End_Time'] < end_a:
df['A'] = ( df['End_Time'] - start_a ) - ( end_a - df['Start_Time'] )
else:
df['A'] = 0
if df['Start_Time'] > start_b and df['End_Time'] < end_b:
df['B'] = ( df['End_Time'] - start_b ) - ( end_b - df['Start_Time'] )
else:
df['B'] = 0
if df['Start_Time'] > start_c and df['End_Time'] < end_c:
df['C'] = ( df['End_Time'] - start_c ) - ( end_c - df['Start_Time'] )
else:
df['C'] = 0
Your problem is a lot harder than I thought. One thing that has to be noticed is that the Start_Time
and End_Time
can have different dates. Furthermore, category C spans over two days. Both of these facts make the code a little bit complicated, but it seems to work.
First, the setup for your problem. I created your data frame and created the variables. Important is that these structures have the correct data types.
import pandas as pd
from io import StringIO
from datetime import datetime, time, date, timedelta
# Create your data frame
data = StringIO("""Product_Name Start_Time End_Time
Product X 2021-10-20 20:32:00 2021-10-21 03:50:00
Product Y 2021-10-21 11:50:00 2021-10-21 16:00:00
Product Z 2022-01-11 20:10:00 2022-01-12 15:30:00""")
df = pd.read_csv(data, sep=' ', engine='python')
# Convert the columns to date time format
df[["Start_Time", "End_Time"]] = df[["Start_Time", "End_Time"]].apply(pd.to_datetime)
# Create the range start and end time as datetime format
start_a = datetime.strptime('05:01:00', '%H:%M:%S')
end_a = datetime.strptime('14:00:00', '%H:%M:%S')
start_b = datetime.strptime('14:01:00', '%H:%M:%S')
end_b = datetime.strptime('22:00:00', '%H:%M:%S')
start_c = datetime.strptime('22:01:00', '%H:%M:%S')
end_c = datetime.strptime('05:00:00', '%H:%M:%S')
Then, I created a function that can calculate the hours for your problem. start
and end
are the times that are defined for one range. The function now iterates over the days and looks at how much of your range fits in it. Usually, it needs only one iteration, but your Product Z
goes over two days and needs therefore two iterations.
def calc_hours(start_time, end_time, start, end):
# Set range to have date also => allows us to compare to start_time and end_time
range_start = datetime.combine(start_time.date(), start.time())
range_end = datetime.combine(start_time.date(), end.time())
# Special case for range C as end of range is on the next day
if (range_end<range_start):
range_end = range_end + timedelta(days=1)
# start_time and end_time can go over one or more days => Iterate over the days and sum the ours in the range
total_hours=0.0
while (range_start < end_time):
# Calculation to get the hours or zero if range is not within time frame
hours_in_frame = max((min(range_end, end_time) - max(range_start, start_time)).total_seconds(), 0)/3600
total_hours += hours_in_frame
# Increment the day to check if range is in time frame
range_start = range_start + timedelta(days=1)
range_end = range_end + timedelta(days=1)
return total_hours
In order to use the function and add the results to the dataframe, I used the function apply()
from pandas
. The apply()
takes each row of your dataframe and calculates the hours within a range with the previously shown function. This is done for all three ranges.
# Use apply to calculate the hours for each row and each range
df['A'] = df.apply(lambda x: calc_hours(x['Start_Time'], x['End_Time'], start_a, end_a), axis=1)
df['B'] = df.apply(lambda x: calc_hours(x['Start_Time'], x['End_Time'], start_b, end_b), axis=1)
df['C'] = df.apply(lambda x: calc_hours(x['Start_Time'], x['End_Time'], start_c, end_c), axis=1)
Output is almost what you wanted, but not rounded to two decimal places:
Product_Name Start_Time End_Time A B C
0 Product X 2021-10-20 20:32:00 2021-10-21 03:50:00 0.000000 1.466667 5.816667
1 Product Y 2021-10-21 11:50:00 2021-10-21 16:00:00 2.166667 1.983333 0.000000
2 Product Z 2022-01-11 20:10:00 2022-01-12 15:30:00 8.983333 3.316667 6.983333