split one row into multiple record python
I have an input dataframe as follows
Class Duration StudentID Age Startdate Start Time Enddate End Time TimeDifference
5th XX 20002 5 04/12/2021 17:00:00 04/14/2021 20:00:00 3000
And I would like to split the same into three different rows based on the start and end date as follows.
Class Duration StudentID Age Startdate Start Time Enddate End Time TimeDifference
5th XX 20002 5 04/12/2021 17:00:00 04/12/2021 23:59:59 360
5th XX 20002 5 04/13/2021 0:00:00 04/13/2021 23:59:59 1440
5th XX 20002 5 04/14/2021 0:00:00 04/14/2021 20:00:00 1200
I am trying with python. Please help.
Input Output is here
I get a slightly different value for 'Time Difference', but this is an approach you can tweak and and use.
Step 1:
You can start by using melt()
with your id_vars
being all your columns except your 'Startdate' and 'Enddate'.
Step 2:
Then you can set your index
to be your StartEndDate column, created after melting your dataframe.
Step 3:
Then using reindex()
you can add the new row with your missing dates.
Lastly what's left is to calculate the time difference column and rearrange your dataframe to get to your final output.
I assume your dataframe is called df
:
# Step 1
ids = [c for c in df.columns if c not in ['Startdate','Enddate']]
new = df.melt(id_vars=ids,value_name = 'StartEndDate').drop('variable',axis=1)
new.loc[new.StartEndDate.isin(df['Startdate'].tolist()),'Start Time'] = "00:00"
print(new)
Class Duration StudentID Age Start Time End Time TimeDifference \
0 5th XX 20002 5 00:00 20:00 3000
1 5th XX 20002 5 17:00 20:00 3000
StartEndDate
0 04/12/2021
1 04/14/2021
# Step 2
new['StartEndDate'] = pd.to_datetime(new['StartEndDate']).dt.date
new.set_index(pd.DatetimeIndex(new.StartEndDate),inplace=True)
# Step 3
final = new.reindex(pd.date_range(new.index.min(),new.index.max()), method='ffill').reset_index()\
.rename({'index':'Startdate'},axis=1).drop('StartEndDate',axis=1)
final['Enddate'] = final['Startdate']
final['TimeDifference'] = (final['End Time'].str[:2].astype(int) - final['Start Time'].str[:2].astype(int))*60
Prints:
final = final[['Class','Duration','StudentID','Age','Startdate','Start Time','Enddate','End Time','TimeDifference']]
Class Duration StudentID Age Startdate Start Time Enddate End Time \
0 5th XX 20002 5 2021-04-12 00:00 2021-04-12 20:00
1 5th XX 20002 5 2021-04-13 00:00 2021-04-13 20:00
2 5th XX 20002 5 2021-04-14 17:00 2021-04-14 20:00
TimeDifference
0 1200
1 1200
2 180
I think some information is missing from your question, so I would suggest running line by line and do the necessary adjustments to suit your task.