how can i tell select() in dplyr that the string it is seeing is a column name in a data frame

I tried searching but didn't find an answer to this question.

I'm trying to use the select statement in dplyr but am having problems when I try to send it strings. My question is, how do i tell select() that the string that it is seeing is a column name in the data frame?

e.g. this works fine

select(df.main.scaled, var1, var3)
select(df.main.scaled, var2, var4)

but this does not work:

select(df.main.scaled, names.gens[i,1], names.gens[i,2])

where

> names.genx <- c("var1","var2")
> names.geny <- c("var3","var4")
> names.gens <- cbind(names.genx, names.geny)
> names.gens
     names.genx names.geny
[1,] "var1"     "var3"    
[2,] "var2"     "var4"  

To be clear, all the strings in names.gens are column names in the data frame.

Thanks.


In more recent versions of dplyr, this is possible in select with one_of, as in

my_cols <- c('mpg', 'disp')
mtcars %>% select(one_of(my_cols))

Select seems to work with the column indexes (dplyr 0.2), so just match your desired names to their index and use them to select the columns.

myCols <- c("mpg","disp")
colNums <- match(myCols,names(mtcars))
mtcars %>% select(colNums)

You can use get() to get the object named by the string in the current environment. So:

R> iris %>% select(Species, Petal.Length) %>% head(3)
  Species Petal.Length
1  setosa          1.4
2  setosa          1.4
3  setosa          1.3
R> iris %>% select('Species', 'Petal.Length') %>% head(3)
Error in abs(ind[ind < 0]) : 
  non-numeric argument to mathematical function     
R> iris %>% select(get('Species'), get('Petal.Length')) %>% head(3)
  Species Petal.Length
1  setosa          1.4
2  setosa          1.4
3  setosa          1.3
R> s <- 'Species'
R> p <- 'Petal.Length'
R> iris %>% select(get(s), get(p)) %>% head(3)
  Species Petal.Length
1  setosa          1.4
2  setosa          1.4
3  setosa          1.3

[Edit - some of the below is now out of date with the release of dplyr 0.7 - see here]

The question is about the difference between standard evaluation and non standard evaluation.

tl;dnr: You can use the 'standard evaluation' counterpart of dplyr::select, which is dplyr::select_ .

This allows you to provide column names as variables which contain strings:

dplyr::select_(df.main.scaled, names.gens[i,1], names.gens[i,2])

Here is lots more detail that tries to explain how this works:

Non standard evaluation and the select function in dplyr

Non-standard evaluation is the evaluation of code in non-standard ways. Often, this means capturing expressions before they are evaluated, and evaluating them in a different environment (context/scope) to normal. When you provide dplyr::select with column names without quotation marks, dplyr is using non-standard evaluation to interpret them as columns.

Examples of use of dplyr::select

Supposing we have the following data frame:

df <- tibble::data_frame(a = 1:5, b = 6:10, c = 11:15, d = 16:20)

A simple example of the select statement is as follows:

r <- dplyr::select(df, a, b)

This is an example of NSE because a and b are not variables that exist in the global environment. Instead of searching for a and b in the global namespace, dplyr::select directs R to search for the variables a and b in the context of dataframe df. You can think of the environment a bit like a list, and a and b as keys. So the following is a bit like telling R to lookup df$a and df$b

Function arguments in R are promises which are not evaluated immediately. They can be captured as expressions and then run in a different environment.

This is fine if we know we want to select the columns a and b in advance. But what if these columns are unknown in advance, and are held in a variable.

columns_to_select <- c("a", "b")

The following does not work:

 dplyr::select(df, columns_to_select)

This error is telling us that there is no column called 'columns_to_select' in the dataframe. The argument columns_to_select has been evaluated in the context of the dataframe, so R has tried to do something like df$columns_to_select, and found that the column does not exist.

How do we fix this?

Tidyverse functions always provide an 'escape hatch' that allow you to get around this limitation. The dplyr vignette says 'Every function in dplyr that uses NSE also has a version that uses SE. The name of the SE version is always the NSE name with an _ on the end.'

What does this mean?

We might try the following, but we find it does not work:

# Does not work
r <-dplyr::select_(df, columns_to_select)

As opposed to capturing the argument columns_to_select to the select_ function and interpreting it as a column name, columns_to_select is evaluated in a standard way, resolving to c("a", "b").

That's what we want, except that each argument to select_ is a single column, and we've just provided a character vector of length two to represent a single column.

The above code therefore returns a tibble with a single column, a, which is not what we wanted. (Only the first element - "a" in the character vector is used, everything else is ignored).

One solution to this problem is as follows, but it assumes that columns_to_select contains exactly two elements:

col1 <- columns_to_select[1]
col2 <- columns_to_select[2]
r <- dplyr::select_(df,col1, col2)

How do we generalise this to the case where columns_to_select may have an arbitrary number of elements?

The solution is to use the optional .dots argument.

 dplyr::select_(df, .dots=columns_to_select)

This bears some explanation

In R, the ... construct allows the creation of functions with a variable (arbitrary) number of arguments. The ... is available within the function, and allows the function body to access all of the arguments. See also here.

A very simple example is as follows:

addition <- function(...) {
  args <- list(...)
  sum(unlist(args))
}
r <- addition(1,2,3)

However, this doesn't immediately help us here. It's actually already implemented in the select_ function and merely enables us to provide an arbitrary number of column names as arguments, e.g. select_(df, "a", "b", "c", "d").

What we need is a mechanism that is similar to ..., but allows us to pass something like ... into the function as a single argument. This is exactly what .dots does.

Note that .dots is not provided by select, because this is designed to be used interactively.