dplyr summarise_each with na.rm

r dplyr

Is there a way to instruct dplyr to use summarise_each with na.rm=TRUE? I would like to take the mean of variables with summarise_each("mean") but I don't know how to specify it to ignore missing values.

Following the links in the doc, it seems you can use funs(mean(., na.rm = TRUE)):

library(dplyr)
by_species <- iris %>% group_by(Species)
by_species %>% summarise_each(funs(mean(., na.rm = TRUE)))

update

the current dplyr version strongly suggests the use of across instead of the more specified functions summarise_all etc.

Translating the below syntax (naming the functions in a named list) into across could look like this:

library(dplyr)
ggplot2::msleep %>% 
  select(vore, sleep_total, sleep_rem) %>%
  group_by(vore) %>%
  summarise(across(everything(), .f = list(mean = mean, max = max, sd = sd), na.rm = TRUE))

#> # A tibble: 5 x 7
#>   vore  sleep_total_mean sleep_total_max sleep_total_sd sleep_rem_mean
#>   <chr>            <dbl>           <dbl>          <dbl>          <dbl>
#> 1 carni            10.4             19.4           4.67           2.29
#> 2 herbi             9.51            16.6           4.88           1.37
#> 3 inse~            14.9             19.9           5.92           3.52
#> 4 omni             10.9             18             2.95           1.96
#> 5 <NA>             10.2             13.7           3.00           1.88
#> # ... with 2 more variables: sleep_rem_max <dbl>, sleep_rem_sd <dbl>

older answer

summarise_each is deprecated now, here an option with summarise_all.

One can still specify na.rm = TRUE within the funs argument (cf @flodel 's answer: just replace summarise_each with summarise_all ).
But you can also add na.rm = TRUE after the funs argument.

That is useful when you want to call more than only one function, e.g.:

edit

the funs() argument is now (soft)deprecated, thanks to comment @Mikko. One can use the suggestions that are given by the warning, see below in the code. na.rm can still be specified as additional argument within summarise_all.

I used ggplot2::msleep because it contains NAs and shows this better.

library(dplyr)

ggplot2::msleep %>% 
  select(vore, sleep_total, sleep_rem) %>%
  group_by(vore) %>%
  summarise_all(funs(mean, max, sd), na.rm = TRUE)
#> Warning: funs() is soft deprecated as of dplyr 0.8.0
#> Please use a list of either functions or lambdas: 
#> 
#>   # Simple named list: 
#>   list(mean = mean, median = median)
#> 
#>   # Auto named with `tibble::lst()`: 
#>   tibble::lst(mean, median)
#> 
#>   # Using lambdas
#>   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

Take for instance mtcars data set

library(dplyr)

You can always use summarise to avoid long syntax:

mtcars %>%
  group_by(cyl) %>% 
  summarise(mean_mpg = mean(mpg, na.rm=T),
            sd_mpg = sd(mpg, na.rm = T))

dplyr summarise_each with na.rm

Related

Recent Posts