Detect at least one match between each data frame row and values in vector

My dataframe looks like this:

x1 <- c("a", "c", "f", "j")
x2 <- c("b", "c", "g", "k")
x3 <- c("b", "d", "h", NA)
x4 <- c("a", "e", "i", NA)
df <- data.frame(x1, x2, x3, x4, stringsAsFactors=F)

df

x1 x2   x3   x4
1  a  b    b    a
2  c  c    d    e
3  f  g    h    i
4  j  k <NA> <NA>

Now I have an arbitrary vector:

vec <- c("a", "i", "s", "t", "z")

I would like to compare the vector values with each row in the data frame and create an additional column that indicates whether at least one (ANY) of the vector values was found or not.

The resulting dataframe should look like this:

  x1 x2   x3   x4 valueFound
1  a  b    b    a          1
2  c  c    d    e          0
3  f  g    h    i          1
4  j  k <NA> <NA>          0

I would like to do it without looping. Thank you very much for your support!

Rami


This would be faster than an apply based solution (despite it's cryptic construction):

as.numeric(rowSums(`dim<-`(as.matrix(df) %in% vec, dim(df))) >= 1)
[1] 1 0 1 0

Update -- Some benchmarks

Here, we can make up some bigger data to test on.... These benchmarks are on 100k rows.

set.seed(1)
nrow <- 100000
ncol <- 10
vec <- c("a", "i", "s", "t", "z")
df <- data.frame(matrix(sample(c(letters, NA), nrow * ncol, TRUE),
                        nrow = nrow, ncol = ncol), stringsAsFactors = FALSE)

Here are the approaches we have so far:

AM <- function() as.numeric(rowSums(`dim<-`(as.matrix(df) %in% vec, dim(df))) >= 1)
NR1 <- function() {
  apply(df,1,function(x){
    if(any(x %in% vec)){ 
      1 
    } else {
      0
    }
  })
}
NR2 <- function() apply(df, 1, function(x) any(x %in% vec) + 0)
NR3 <- function() apply(df, 1, function(x) as.numeric(any(x %in% vec)))
NR4 <- function() apply(df, 1, function(x) any(x %in% vec) %/% TRUE)
NR5 <- function() apply(df, 1, function(x) cumprod(any(x %in% vec)))
RS1 <- function() as.numeric(grepl(paste(vec, collapse="|"), do.call(paste, df)))
RS2 <- function() as.numeric(seq(nrow(df)) %in% row(df)[unlist(df) %in% vec])

I'm suspecting the NR functions will be a little slower:

system.time(NR1()) # Other NR functions are about the same
#    user  system elapsed 
#   1.172   0.000   1.196 

And, similarly, Richard's second approach:

system.time(RS2())
#    user  system elapsed 
#   0.918   0.000   0.932 

The grepl and this rowSum function are left for the benchmarks:

library(microbenchmark)
microbenchmark(AM(), RS1())
# Unit: milliseconds
#   expr       min       lq      mean    median       uq      max neval
#   AM()  65.75296  67.2527  92.03043  84.58111 102.3199 234.6114   100
#  RS1() 253.57360 256.6148 266.89640 260.18038 264.1531 385.6525   100

Here's one way to do this:

df$valueFound <- apply(df,1,function(x){
  if(any(x %in% vec)){ 
    1 
  } else {
    0
  }
})
##
> df
  x1 x2   x3   x4 valueFound
1  a  b    b    a          1
2  c  c    d    e          0
3  f  g    h    i          1
4  j  k <NA> <NA>          0

Thanks to @David Arenburg and @CathG, a couple of more concise approaches:

  • apply(df, 1, function(x) any(x %in% vec) + 0)
  • apply(df, 1, function(x) as.numeric(any(x %in% vec)))

Just for fun, a couple of other interesting variants:

  • apply(df, 1, function(x) any(x %in% vec) %/% TRUE)
  • apply(df, 1, function(x) cumprod(any(x %in% vec)))

Since you don't want a loop, you could get creative and paste the columns together by row, and then use grepl to compare it with vec

> as.numeric(grepl(paste(vec, collapse="|"), do.call(paste, df)))
[1] 1 0 1 0 

Here's a second option that compares the rows to the unlisted data frame

> as.numeric(seq(nrow(df)) %in% row(df)[unlist(df) %in% vec])
[1] 1 0 1 0