dplyr filter with condition on multiple columns
I'd like to remove rows corresponding to a particular combination of variables from my data frame.
Here's a dummy data :
father<- c(1, 1, 1, 1, 1)
mother<- c(1, 1, 1, NA, NA)
children <- c(NA, NA, 2, 5, 2)
cousins <- c(NA, 5, 1, 1, 4)
dataset <- data.frame(father, mother, children, cousins)
dataset
father mother children cousins
1 1 NA NA
1 1 NA 5
1 1 2 1
1 NA 5 1
1 NA 2 4
I want to filter this row :
father mother children cousins
1 1 NA NA
I can do it with :
test <- dataset %>%
filter(father==1 & mother==1) %>%
filter (is.na(children)) %>%
filter (is.na(cousins))
test
My question : I have many columns like grand father, uncle1, uncle2, uncle3 and I want to avoid something like that:
filter (is.na(children)) %>%
filter (is.na(cousins)) %>%
filter (is.na(uncle1)) %>%
filter (is.na(uncle2)) %>%
filter (is.na(uncle3))
and so on...
How can I use dplyr to say filter all the column with na (except father==1 & mother==1)
Solution 1:
A possible dplyr
(0.5.0.9004 <= version < 1.0) solution is:
# > packageVersion('dplyr')
# [1] ‘0.5.0.9004’
dataset %>%
filter(!is.na(father), !is.na(mother)) %>%
filter_at(vars(-father, -mother), all_vars(is.na(.)))
Explanation:
-
vars(-father, -mother)
: select all columns exceptfather
andmother
. -
all_vars(is.na(.))
: keep rows whereis.na
isTRUE
for all the selected columns.
note: any_vars
should be used instead of all_vars
if rows where is.na
is TRUE
for any column are to be kept.
Update (2020-11-28)
As the _at
functions and vars
have been superseded by the use of across
since dplyr 1.0, the following way (or similar) is recommended now:
dataset %>%
filter(across(c(father, mother), ~ !is.na(.x))) %>%
filter(across(c(-father, -mother), is.na))
See more example of across
and how to rewrite previous code with the new approach here: Colomn-wise operatons or type vignette("colwise")
in R after installing the latest version of dplyr
.
Solution 2:
dplyr >= 1.0.4
If you're using dplyr version >= 1.0.4 you really should use if_any
or if_all
, which specifically combines the results of the predicate function into a single logical vector making it very useful in filter
. The syntax is identical to across
, but these verbs were added to help fill this need: if_any/if_all.
library(dplyr)
dataset %>%
filter(if_all(-c(father, mother), ~ is.na(.)), if_all(c(father, mother), ~ !is.na(.)))
Output
father mother children cousins
1 1 1 NA NA
Solution 3:
None of the answers seems to be an adaptable solution. I think the intention is not to list all the variables and values to filter the data.
One easy way to achieve this is through merging. If you have all the conditions in df_filter then you can do this:
df_results = df_filter %>% left_join(df_all)