How to build a real-time dataframe in R?
Let's say I have two dataframes like the ones below:
df1 = structure(list(Date = c("2000-01-05", "2000-02-03", "2000-03-02",
"2000-03-30", "2000-04-13", "2000-05-11", "2000-06-08", "2000-07-06",
"2000-09-14", "2000-10-19", "2000-11-02", "2000-12-14", "2001-02-01",
"2001-03-01", "2001-04-11", "2001-05-10", "2001-06-07", "2001-06-21",
"2001-07-05", "2001-08-30", "2001-10-11", "2001-11-08", "2001-12-06"
)), row.names = c(NA, 23L), class = "data.frame")
Date
1 2000-01-05
2 2000-02-03
3 2000-03-02
4 2000-03-30
5 2000-04-13
6 2000-05-11
7 2000-06-08
8 2000-07-06
9 2000-09-14
10 2000-10-19
11 2000-11-02
12 2000-12-14
13 2001-02-01
14 2001-03-01
15 2001-04-11
16 2001-05-10
17 2001-06-07
18 2001-06-21
19 2001-07-05
20 2001-08-30
21 2001-10-11
22 2001-11-08
23 2001-12-06
df2 = structure(list(Date = structure(c(10987, 11016, 11047, 11077,
11108, 11138, 11169, 11200, 11230, 11261, 11291, 11322, 11353,
11381, 11412, 11442, 11473, 11503, 11534, 11565, 11595, 11626,
11656, 11687), class = "Date"), x = c(3.04285714285714, 3.27571428571429,
3.5104347826087, 3.685, 3.92, 4.29454545454545, 4.30857142857143,
4.41913043478261, 4.59047619047619, 4.76272727272727, 4.82909090909091,
4.82684210526316, 4.75590909090909, 4.9925, 4.78136363636364,
5.06421052631579, 4.65363636363636, 4.53952380952381, 4.50545454545454,
4.49130434782609, 3.9865, 3.97130434782609, 3.50727272727273,
3.33888888888889)), row.names = c(NA, 24L), class = "data.frame")
Date x
1 2000-01-31 3.042857
2 2000-02-29 3.275714
3 2000-03-31 3.510435
4 2000-04-30 3.685000
5 2000-05-31 3.920000
6 2000-06-30 4.294545
7 2000-07-31 4.308571
8 2000-08-31 4.419130
9 2000-09-30 4.590476
10 2000-10-31 4.762727
11 2000-11-30 4.829091
12 2000-12-31 4.826842
13 2001-01-31 4.755909
14 2001-02-28 4.992500
15 2001-03-31 4.781364
16 2001-04-30 5.064211
17 2001-05-31 4.653636
18 2001-06-30 4.539524
19 2001-07-31 4.505455
20 2001-08-31 4.491304
21 2001-09-30 3.986500
22 2001-10-31 3.971304
23 2001-11-30 3.507273
24 2001-12-31 3.338889
Now, what I would like to do is to create a real-time dataframe, that is, the data in df2
that were only available at the time of df1
. For instance, at 2000-01-05
(first row in df1
) no data in df2
was available since since 2000-01-31
(first row of df2
) occurs after 2000-01-05
. However, in 2000-02-03
(second row in df1
) the observation in 2000-01-31
(first row of df2
) is available. This should be the reasoning for every row. The outcome should look like this:
Date y
1 2000-01-05 NA
2 2000-02-03 3.042857
3 2000-03-02 3.275714
4 2000-03-30 3.275714
5 2000-04-13 3.510435
6 2000-05-11 3.685000
....
The rule would be: pick up from df2
only the observation that was available at the time of df1
.
Can anyone help me?
Thanks!
What you can do is complete the df2
dates and then join.
library(dplyr)
library(tidyr)
# create a dataframe with all the days, not just the snapshots
df2_complete <- df2 %>%
complete(Date = seq.Date(min(Date), max(Date), by = "day")) %>%
fill(x, .direction = "down")
# convert to Date class for this case and join
df1 %>%
mutate(Date = as.Date(Date)) %>%
left_join(df2_complete, by = "Date")
Which gives:
Date x
1 2000-01-05 NA
2 2000-02-03 3.042857
3 2000-03-02 3.275714
4 2000-03-30 3.275714
5 2000-04-13 3.510435
6 2000-05-11 3.685000
....