How to filter rows based on difference in dates between rows in R?
Within each id
, I would like to keep rows that are at least 91 days apart. In my dataframe df
below, id=1
has 5 rows and id=2
has 1 row.
For id=1
, I would like to keep only the 1st, 3rd and 5th rows.
This is because if we compare 1st date and 2nd date, they differ by 32 days. So, remove 2nd date. We proceed to comparing 1st and 3rd date, and they differ by 152 days. So, we keep 3rd date.
Now, instead of using 1st date as reference, we use 3rd date. 3rd date and 4th date differ by 61 days. So, remove 4th date. We proceed to comparing 3rd date and 5th date, and they differ by 121 days. So, we keep 5th date.
In the end, the dates we keep are 1st, 3rd and 5th dates. As for id=2
, there is only one row, so we keep that. The desired result is shown in dfnew
.
df <- read.table(header = TRUE, text = "
id var1 date
1 A 2006-01-01
1 B 2006-02-02
1 C 2006-06-02
1 D 2006-08-02
1 E 2007-12-01
2 F 2007-04-20
",stringsAsFactors=FALSE)
dfnew <- read.table(header = TRUE, text = "
id var1 date
1 A 2006-01-01
1 C 2006-06-02
1 E 2007-12-01
2 F 2007-04-20
",stringsAsFactors=FALSE)
I can only think of starting with grouping the df
by id
as follows:
library(dplyr)
dfnew <- df %>% group_by(id)
However, I am not sure of how to continue from here. Should I proceed with filter
function or slice
? If so, how?
Solution 1:
Here's an attempt using rolling joins in the data.table
which I believe should be efficient
library(data.table)
# Set minimum distance
mindist <- 91L
# Make sure it is a real Date
setDT(df)[, date := as.IDate(date)]
# Create a new column with distance + 1 to roll join too
df[, date2 := date - (mindist + 1L)]
# Perform a rolling join per each value in df$date2 that has atleast 91 difference from df$date
unique(df[df, on = c(id = "id", date = "date2"), roll = -Inf], by = c("id", "var1"))
# id var1 date date2 i.var1 i.date
# 1: 1 A 2005-10-01 2005-10-01 A 2006-01-01
# 2: 1 C 2006-03-02 2006-03-02 C 2006-06-02
# 3: 1 E 2007-08-31 2007-08-31 E 2007-12-01
# 4: 2 F 2007-01-18 2007-01-18 F 2007-04-20
This will give you two additional columns but it's not a big of a deal IMO. Logically this makes sense and I've tested it successfully on different scenarios but it may need some additional proof tests.
Solution 2:
An alternative that uses slice
from dplyr
is to define the following recursive function:
library(dplyr)
f <- function(d, ind=1) {
ind.next <- first(which(difftime(d,d[ind], units="days") > 90))
if (is.na(ind.next))
return(ind)
else
return(c(ind, f(d,ind.next)))
}
This function operates on the date
column starting at ind = 1
. It then finds the next index ind.next
that is the first
index for which the date is greater than 90 days (at least 91 days) from the date indexed by ind
. Note that if there is no such ind.next
, ind.next==NA
and we just return ind
. Otherwise, we recursively call f
starting at ind.next
and return its result concatenated with ind
. The end result of this function call are the row indices separated by at least 91 days.
With this function, we can do:
result <- df %>% group_by(id) %>% slice(f(as.Date(date, format="%Y-%m-%d")))
##Source: local data frame [4 x 3]
##Groups: id [2]
##
## id var1 date
## <int> <chr> <chr>
##1 1 A 2006-01-01
##2 1 C 2006-06-02
##3 1 E 2007-12-01
##4 2 F 2007-04-20
The use of this function assumes that the date
column is sorted in ascending order by each id
group. If not, we can just sort the dates before slicing. Not sure about the efficiency of this or the dangers of recursive calls in R. Hopefully, David Arenburg or others can comment on this.
As suggested by David Arenburg, it is better to convert date
to a Date class first instead of by group:
result <- df %>% mutate(date=as.Date(date, format="%Y-%m-%d")) %>%
group_by(id) %>% slice(f(date))
##Source: local data frame [4 x 3]
##Groups: id [2]
##
## id var1 date
## <int> <chr> <date>
##1 1 A 2006-01-01
##2 1 C 2006-06-02
##3 1 E 2007-12-01
##4 2 F 2007-04-20