Calculate the sum of the differences between all dates within an expanding window of dates
The output
column below is what I'm trying to calculate and the diffs
column is an explanation of the differences that are summed to calculate output
.
+------------+--------+-------------+
| date | output | diffs |
+------------+--------+-------------+
| 01/01/2000 | | |
| 10/01/2000 | 9 | [9] |
| 20/01/2000 | 29 | [10, 19] |
| 25/01/2000 | 44 | [5, 15, 24] |
+------------+--------+-------------+
I've thought about using rolling
and then creating a new column within each window for the diffs based on the last record in the current window and then summing these. However, rolling
doesn't seem to have the ability to fix at the beginning of a DataFrame. I suppose I could calculate the difference between the minimum and maximum dates and use this as the rolling period but that seems hacky.
I've also looked at expanding
but I couldn't see a way of creating new diffs as the window expanded.
Is there a non-loop, hopefully vectorisable, solution to this?
Here's the DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
'date': (
dt.datetime(2000, 1, 1), dt.datetime(2000, 1, 10),
dt.datetime(2000, 1, 20), dt.datetime(2000, 1, 25),
),
'output': (np.NaN, 9, 29, 44),
}
)
Solution 1:
If you're looking for output, try:
datediff = df.date.diff()/pd.Timedelta('1D')
df['output'] = (datediff * np.arange(len(df))).cumsum()
Output:
date output
0 2000-01-01 NaN
1 2000-01-10 9.0
2 2000-01-20 29.0
3 2000-01-25 44.0
I'll leave the it to you to work out the logic behind.