standard evaluation in dplyr: summarise a variable given as a character string
Solution 1:
Please note that this answer does not apply to dplyr >= 0.7.0
, but to previous versions.
[
dplyr 0.7.0
] has a new approach to non-standard evaluation (NSE) called tidyeval. It is described in detail invignette("programming")
.
The dplyr
vignette on non-standard evalutation is helpful here. Check the section "Mixing constants and variables" and you find that the function interp
from package lazyeval
could be used, and "[u]se as.name
if you have a character string that gives a variable name":
library(lazyeval)
df %>%
select(-matches(drp)) %>%
group_by_(key) %>%
summarise_(sum_val = interp(~sum(var, na.rm = TRUE), var = as.name(val)))
# v3 sum_val
# 1 A 21
# 2 B 19
Solution 2:
With the release of the rlang package and the 0.7.0 update to dplyr, this is now fairly simple.
When you want to use a character string (e.g. "v1") as a variable name, you just:
- Convert the string to a symbol using
sym()
from the rlang package - In your function call, write
!!
in front of the symbol
For instance, you'd do the following:
my_var <- "Sepal.Length"
my_sym <- sym(my_var)
summarize(iris, Mean = mean(!!my_sym))
More compactly, you could combine the step of converting your string to a symbol with sym()
and prefixing it with !!
when writing your function call.
For instance, you could write:
my_var <- "Sepal.Length"
summarize(iris, mean(!!sym(my_var)))
To return to your original example, you could do the following:
library(rlang)
key <- "v3"
val <- "v2"
drp <- "v1"
df <- data_frame(v1 = 1:5, v2 = 6:10, v3 = c(rep("A", 3), rep("B", 2)))
df %>%
# NOTE: we don't have to do anything to `drp`
# since the matches() function expects a character string
select(-matches(drp)) %>%
group_by(!!sym(key)) %>%
summarise(sum(!!sym(val), na.rm = TRUE))
Alternative Syntax
With the release of rlang version 0.4.0, you can use the following syntax:
my_var <- "Sepal.Length"
my_sym <- sym(my_var)
summarize(iris, Mean = mean({{ my_sym }}))
Instead of writing !!my_sym
, you can write {{ my_sym }}
. This has the advantage of being arguably clearer, but has the disadvantage that you have to convert the string to a symbol before placing it inside the brackets. For instance, you can write !!sym(my_var)
but you can't write {{sym(my_var)}}
Additional details
Of all the official documentation explaining how the usage of sym()
and !!
works, these seem to be the most accessible:
dplyr vignette: Programming with dplyr
The section of Hadley Wickham's book 'Advanced R' on metaprogramming
Solution 3:
Pass the .dots
argument a list of strings constructing the strings using paste
, sprintf
or using string interpolation from package gsubfn via fn$list
in place of list
as we do here:
library(gsubfn)
df %>%
group_by_(key) %>%
summarise_(.dots = fn$list(mean = "mean($val)", sd = "sd($val)"))
giving:
Source: local data frame [2 x 3]
v3 mean sd
1 A 7.0 1.0000000
2 B 9.5 0.7071068
Solution 4:
New dplyr update:
The new functionality of dplyr can help with this. Instead of strings for the variables that need non-standard evaluation, we use quosures quo()
. We undo the quoting with another function !!
. For more on these see this vignette. You will need the developer's version of dplyr until the full release.
library(dplyr) #0.5.0.9004+
key <- quo(v3)
val <- quo(v2)
drp <- "v1"
df <- data_frame(v1 = 1:5, v2 = 6:10, v3 = c(rep("A", 3), rep("B", 2)))
df %>% select(-matches("v1")) %>%
group_by(!!key) %>%
summarise(sum(!!val, na.rm = TRUE))
# # A tibble: 2 × 2
# v3 `sum(v2, na.rm = TRUE)`
# <chr> <int>
# 1 A 21
# 2 B 19
Solution 5:
dplyr
1.0 has changed pretty much everything about this question as well as all of the answers. See the dplyr
programming vignette here:
https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html
The new way to refer to columns when their identifier is stored as a character vector is to use the .data
pronoun from rlang
, and then subset as you would in base R.
library(dplyr)
key <- "v3"
val <- "v2"
drp <- "v1"
df <- tibble(v1 = 1:5, v2 = 6:10, v3 = c(rep("A", 3), rep("B", 2)))
df %>%
select(-matches(drp)) %>%
group_by(.data[[key]]) %>%
summarise(total = sum(.data[[val]], na.rm = TRUE))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#> v3 total
#> <chr> <int>
#> 1 A 21
#> 2 B 19
If your code is in a package function, you can @importFrom rlang .data
to avoid R check notes about undefined globals.