standard evaluation in dplyr: summarise a variable given as a character string

Solution 1:

Please note that this answer does not apply to dplyr >= 0.7.0, but to previous versions.

[dplyr 0.7.0] has a new approach to non-standard evaluation (NSE) called tidyeval. It is described in detail in vignette("programming").


The dplyr vignette on non-standard evalutation is helpful here. Check the section "Mixing constants and variables" and you find that the function interp from package lazyeval could be used, and "[u]se as.name if you have a character string that gives a variable name":

library(lazyeval)
df %>%
  select(-matches(drp)) %>%
  group_by_(key) %>%
  summarise_(sum_val = interp(~sum(var, na.rm = TRUE), var = as.name(val)))
#   v3 sum_val
# 1  A      21
# 2  B      19

Solution 2:

With the release of the rlang package and the 0.7.0 update to dplyr, this is now fairly simple.

When you want to use a character string (e.g. "v1") as a variable name, you just:

  1. Convert the string to a symbol using sym() from the rlang package
  2. In your function call, write !! in front of the symbol

For instance, you'd do the following:

my_var <- "Sepal.Length"
my_sym <- sym(my_var)
summarize(iris, Mean = mean(!!my_sym))

More compactly, you could combine the step of converting your string to a symbol with sym() and prefixing it with !! when writing your function call.

For instance, you could write:

my_var <- "Sepal.Length"
summarize(iris, mean(!!sym(my_var)))


To return to your original example, you could do the following:

library(rlang)

key <- "v3"
val <- "v2"
drp <- "v1"

df <- data_frame(v1 = 1:5, v2 = 6:10, v3 = c(rep("A", 3), rep("B", 2)))

df %>% 
  # NOTE: we don't have to do anything to `drp`
  # since the matches() function expects a character string
  select(-matches(drp)) %>% 
  group_by(!!sym(key)) %>% 
  summarise(sum(!!sym(val), na.rm = TRUE))


Alternative Syntax

With the release of rlang version 0.4.0, you can use the following syntax:

my_var <- "Sepal.Length"
my_sym <- sym(my_var)
summarize(iris, Mean = mean({{ my_sym }}))

Instead of writing !!my_sym, you can write {{ my_sym }}. This has the advantage of being arguably clearer, but has the disadvantage that you have to convert the string to a symbol before placing it inside the brackets. For instance, you can write !!sym(my_var) but you can't write {{sym(my_var)}}

Additional details

Of all the official documentation explaining how the usage of sym() and !! works, these seem to be the most accessible:

  1. dplyr vignette: Programming with dplyr

  2. The section of Hadley Wickham's book 'Advanced R' on metaprogramming

Solution 3:

Pass the .dots argument a list of strings constructing the strings using paste, sprintf or using string interpolation from package gsubfn via fn$list in place of list as we do here:

library(gsubfn)
df %>% 
   group_by_(key) %>% 
   summarise_(.dots = fn$list(mean = "mean($val)", sd = "sd($val)"))

giving:

Source: local data frame [2 x 3]

  v3 mean        sd
1  A  7.0 1.0000000
2  B  9.5 0.7071068

Solution 4:

New dplyr update:

The new functionality of dplyr can help with this. Instead of strings for the variables that need non-standard evaluation, we use quosures quo(). We undo the quoting with another function !!. For more on these see this vignette. You will need the developer's version of dplyr until the full release.

library(dplyr) #0.5.0.9004+
key <- quo(v3)
val <- quo(v2)
drp <- "v1"

df <- data_frame(v1 = 1:5, v2 = 6:10, v3 = c(rep("A", 3), rep("B", 2)))
df %>% select(-matches("v1")) %>% 
  group_by(!!key) %>% 
  summarise(sum(!!val, na.rm = TRUE))
# # A tibble: 2 × 2
#      v3 `sum(v2, na.rm = TRUE)`
#   <chr>                   <int>
# 1     A                      21
# 2     B                      19

Solution 5:

dplyr 1.0 has changed pretty much everything about this question as well as all of the answers. See the dplyr programming vignette here:

https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html

The new way to refer to columns when their identifier is stored as a character vector is to use the .data pronoun from rlang, and then subset as you would in base R.

library(dplyr)

key <- "v3"
val <- "v2"
drp <- "v1"

df <- tibble(v1 = 1:5, v2 = 6:10, v3 = c(rep("A", 3), rep("B", 2)))

df %>% 
    select(-matches(drp)) %>% 
    group_by(.data[[key]]) %>% 
    summarise(total = sum(.data[[val]], na.rm = TRUE))

#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#>   v3    total
#>   <chr> <int>
#> 1 A        21
#> 2 B        19

If your code is in a package function, you can @importFrom rlang .data to avoid R check notes about undefined globals.