Find which interval row in a data frame that each element of a vector belongs in
I have a vector of numeric elements, and a dataframe with two columns that define the start and end points of intervals. Each row in the dataframe is one interval. I want to find out which interval each element in the vector belongs to.
Here's some example data:
# Find which interval that each element of the vector belongs in
library(tidyverse)
elements <- c(0.1, 0.2, 0.5, 0.9, 1.1, 1.9, 2.1)
intervals <- frame_data(~phase, ~start, ~end,
"a", 0, 0.5,
"b", 1, 1.9,
"c", 2, 2.5)
The same example data for those who object to the tidyverse:
elements <- c(0.1, 0.2, 0.5, 0.9, 1.1, 1.9, 2.1)
intervals <- structure(list(phase = c("a", "b", "c"),
start = c(0, 1, 2),
end = c(0.5, 1.9, 2.5)),
.Names = c("phase", "start", "end"),
row.names = c(NA, -3L),
class = "data.frame")
Here's one way to do it:
library(intrval)
phases_for_elements <-
map(elements, ~.x %[]% data.frame(intervals[, c('start', 'end')])) %>%
map(., ~unlist(intervals[.x, 'phase']))
Here's the output:
[[1]]
phase
"a"
[[2]]
phase
"a"
[[3]]
phase
"a"
[[4]]
character(0)
[[5]]
phase
"b"
[[6]]
phase
"b"
[[7]]
phase
"c"
But I'm looking for a simpler method with less typing. I've seen findInterval
in related questions, but I'm not sure how I can use it in this situation.
Solution 1:
Here's a possible solution using the new "non-equi" joins in data.table
(v>=1.9.8). While I doubt you'll like the syntax, it should be very efficient soluion.
Also, regarding findInterval
, this function assumes continuity in your intervals, while this isn't the case here, so I doubt there is a straightforward solution using it.
library(data.table) #v1.10.0
setDT(intervals)[data.table(elements), on = .(start <= elements, end >= elements)]
# phase start end
# 1: a 0.1 0.1
# 2: a 0.2 0.2
# 3: a 0.5 0.5
# 4: NA 0.9 0.9
# 5: b 1.1 1.1
# 6: b 1.9 1.9
# 7: c 2.1 2.1
Regarding the above code, I find it pretty self-explanatory: Join intervals
and elements
by the condition specified in the on
operator. That's pretty much it.
There is a certain caveat here though, start
, end
and elements
should be all of the same type, so if one of them is integer
, it should be converted to numeric
first.
Solution 2:
cut
is possibly useful here.
out <- cut(elements, t(intervals[c("start","end")]))
levels(out)[c(FALSE,TRUE)] <- NA
intervals$phase[out]
#[1] "a" "a" "a" NA "b" "b" "c"
Solution 3:
David Arenburg's mention of non-equi joins was very helpful for understanding what general kind of problem this is (thanks!). I can see now that it's not implemented for dplyr. Thanks to this answer, I see that there is a fuzzyjoin package that can do it in the same idiom. But it's barely any simpler than my map
solution above (though more readable, in my view), and doesn't hold a candle to thelatemail's cut
answer for brevity.
For my example above, the fuzzyjoin solution would be
library(fuzzyjoin)
library(tidyverse)
fuzzy_left_join(data.frame(elements), intervals,
by = c("elements" = "start", "elements" = "end"),
match_fun = list(`>=`, `<=`)) %>%
distinct()
Which gives:
elements phase start end
1 0.1 a 0 0.5
2 0.2 a 0 0.5
3 0.5 a 0 0.5
4 0.9 <NA> NA NA
5 1.1 b 1 1.9
6 1.9 b 1 1.9
7 2.1 c 2 2.5
Solution 4:
Inspired by @thelatemail's cut
solution, here is one using findInterval
which still requires a lot of typing:
out <- findInterval(elements, t(intervals[c("start","end")]), left.open = TRUE)
out[!(out %% 2)] <- NA
intervals$phase[out %/% 2L + 1L]
#[1] "a" "a" "a" NA "b" "b" "c"
Caveat cut
and findInterval
have left-open intervals. Therefore, solutions using cut
and findInterval
are not equivalent to Ben's using intrval
, David's non-equi join using data.table
, and my other solution using foverlaps
.
Solution 5:
Just lapply
works:
l <- lapply(elements, function(x){
intervals$phase[x >= intervals$start & x <= intervals$end]
})
str(l)
## List of 7
## $ : chr "a"
## $ : chr "a"
## $ : chr "a"
## $ : chr(0)
## $ : chr "b"
## $ : chr "b"
## $ : chr "c"
or in purrr
, if you purrrfurrr,
elements %>%
map(~intervals$phase[.x >= intervals$start & .x <= intervals$end]) %>%
# Clean up a bit. Shorter, but less readable: map_chr(~.x[1] %||% NA)
map_chr(~ifelse(length(.x) == 0, NA, .x))
## [1] "a" "a" "a" NA "b" "b" "c"