Detect at least one match between each data frame row and values in vector
My dataframe looks like this:
x1 <- c("a", "c", "f", "j")
x2 <- c("b", "c", "g", "k")
x3 <- c("b", "d", "h", NA)
x4 <- c("a", "e", "i", NA)
df <- data.frame(x1, x2, x3, x4, stringsAsFactors=F)
df
x1 x2 x3 x4
1 a b b a
2 c c d e
3 f g h i
4 j k <NA> <NA>
Now I have an arbitrary vector:
vec <- c("a", "i", "s", "t", "z")
I would like to compare the vector values with each row in the data frame and create an additional column that indicates whether at least one (ANY) of the vector values was found or not.
The resulting dataframe should look like this:
x1 x2 x3 x4 valueFound
1 a b b a 1
2 c c d e 0
3 f g h i 1
4 j k <NA> <NA> 0
I would like to do it without looping. Thank you very much for your support!
Rami
This would be faster than an apply
based solution (despite it's cryptic construction):
as.numeric(rowSums(`dim<-`(as.matrix(df) %in% vec, dim(df))) >= 1)
[1] 1 0 1 0
Update -- Some benchmarks
Here, we can make up some bigger data to test on.... These benchmarks are on 100k rows.
set.seed(1)
nrow <- 100000
ncol <- 10
vec <- c("a", "i", "s", "t", "z")
df <- data.frame(matrix(sample(c(letters, NA), nrow * ncol, TRUE),
nrow = nrow, ncol = ncol), stringsAsFactors = FALSE)
Here are the approaches we have so far:
AM <- function() as.numeric(rowSums(`dim<-`(as.matrix(df) %in% vec, dim(df))) >= 1)
NR1 <- function() {
apply(df,1,function(x){
if(any(x %in% vec)){
1
} else {
0
}
})
}
NR2 <- function() apply(df, 1, function(x) any(x %in% vec) + 0)
NR3 <- function() apply(df, 1, function(x) as.numeric(any(x %in% vec)))
NR4 <- function() apply(df, 1, function(x) any(x %in% vec) %/% TRUE)
NR5 <- function() apply(df, 1, function(x) cumprod(any(x %in% vec)))
RS1 <- function() as.numeric(grepl(paste(vec, collapse="|"), do.call(paste, df)))
RS2 <- function() as.numeric(seq(nrow(df)) %in% row(df)[unlist(df) %in% vec])
I'm suspecting the NR functions will be a little slower:
system.time(NR1()) # Other NR functions are about the same
# user system elapsed
# 1.172 0.000 1.196
And, similarly, Richard's second approach:
system.time(RS2())
# user system elapsed
# 0.918 0.000 0.932
The grepl
and this rowSum
function are left for the benchmarks:
library(microbenchmark)
microbenchmark(AM(), RS1())
# Unit: milliseconds
# expr min lq mean median uq max neval
# AM() 65.75296 67.2527 92.03043 84.58111 102.3199 234.6114 100
# RS1() 253.57360 256.6148 266.89640 260.18038 264.1531 385.6525 100
Here's one way to do this:
df$valueFound <- apply(df,1,function(x){
if(any(x %in% vec)){
1
} else {
0
}
})
##
> df
x1 x2 x3 x4 valueFound
1 a b b a 1
2 c c d e 0
3 f g h i 1
4 j k <NA> <NA> 0
Thanks to @David Arenburg and @CathG, a couple of more concise approaches:
apply(df, 1, function(x) any(x %in% vec) + 0)
apply(df, 1, function(x) as.numeric(any(x %in% vec)))
Just for fun, a couple of other interesting variants:
apply(df, 1, function(x) any(x %in% vec) %/% TRUE)
apply(df, 1, function(x) cumprod(any(x %in% vec)))
Since you don't want a loop, you could get creative and paste the columns together by row, and then use grepl
to compare it with vec
> as.numeric(grepl(paste(vec, collapse="|"), do.call(paste, df)))
[1] 1 0 1 0
Here's a second option that compares the rows to the unlisted data frame
> as.numeric(seq(nrow(df)) %in% row(df)[unlist(df) %in% vec])
[1] 1 0 1 0