Filter multiple values on a string column in dplyr
I have a data.frame
with character data in one of the columns.
I would like to filter multiple options in the data.frame
from the same column. Is there an easy way to do this that I'm missing?
Example:
data.frame
name = dat
days name
88 Lynn
11 Tom
2 Chris
5 Lisa
22 Kyla
1 Tom
222 Lynn
2 Lynn
I'd like to filter out Tom
and Lynn
for example.
When I do:
target <- c("Tom", "Lynn")
filt <- filter(dat, name == target)
I get this error:
longer object length is not a multiple of shorter object length
Solution 1:
You need %in%
instead of ==
:
library(dplyr)
target <- c("Tom", "Lynn")
filter(dat, name %in% target) # equivalently, dat %>% filter(name %in% target)
Produces
days name
1 88 Lynn
2 11 Tom
3 1 Tom
4 222 Lynn
5 2 Lynn
To understand why, consider what happens here:
dat$name == target
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Basically, we're recycling the two length target
vector four times to match the length of dat$name
. In other words, we are doing:
Lynn == Tom
Tom == Lynn
Chris == Tom
Lisa == Lynn
... continue repeating Tom and Lynn until end of data frame
In this case we don't get an error because I suspect your data frame actually has a different number of rows that don't allow recycling, but the sample you provide does (8 rows). If the sample had had an odd number of rows I would have gotten the same error as you. But even when recycling works, this is clearly not what you want. Basically, the statement dat$name == target
is equivalent to saying:
return
TRUE
for every odd value that is equal to "Tom" or every even value that is equal to "Lynn".
It so happens that the last value in your sample data frame is even and equal to "Lynn", hence the one TRUE
above.
To contrast, dat$name %in% target
says:
for each value in
dat$name
, check that it exists intarget
.
Very different. Here is the result:
[1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
Note your problem has nothing to do with dplyr
, just the mis-use of ==
.
Solution 2:
This can be achieved using dplyr package, which is available in CRAN. The simple way to achieve this:
- Install
dplyr
package. - Run the below code
library(dplyr)
df<- select(filter(dat,name=='tom'| name=='Lynn'), c('days','name))
Explanation:
So, once we’ve downloaded dplyr, we create a new data frame by using two different functions from this package:
filter: the first argument is the data frame; the second argument is the condition by which we want it subsetted. The result is the entire data frame with only the rows we wanted. select: the first argument is the data frame; the second argument is the names of the columns we want selected from it. We don’t have to use the names() function, and we don’t even have to use quotation marks. We simply list the column names as objects.
Solution 3:
Using the base
package:
df <- data.frame(days = c(88, 11, 2, 5, 22, 1, 222, 2), name = c("Lynn", "Tom", "Chris", "Lisa", "Kyla", "Tom", "Lynn", "Lynn"))
# Three lines
target <- c("Tom", "Lynn")
index <- df$name %in% target
df[index, ]
# One line
df[df$name %in% c("Tom", "Lynn"), ]
Output:
days name
1 88 Lynn
2 11 Tom
6 1 Tom
7 222 Lynn
8 2 Lynn
Using sqldf
:
library(sqldf)
# Two alternatives:
sqldf('SELECT *
FROM df
WHERE name = "Tom" OR name = "Lynn"')
sqldf('SELECT *
FROM df
WHERE name IN ("Tom", "Lynn")')