Select columns based on string match - dplyr::select
I have a data frame ("data") with lots and lots of columns. Some of the columns contain a certain string ("search_string").
How can I use dplyr::select()
to give me a subset including only the columns that contain the string?
I tried:
# columns as boolean vector
select(data, grepl("search_string",colnames(data)))
# columns as vector of column names names
select(data, colnames(data)[grepl("search_string",colnames(data))])
Neither of them work.
I know that select()
accepts numeric vectors as substitute for columns e.g.:
select(data,5,7,9:20)
But I don't know how to get a numeric vector of columns ID
s from my grepl()
expression.
Solution 1:
Within the dplyr world, try:
select(iris,contains("Sepal"))
See the Selection section in ?select
for numerous other helpers like starts_with
, ends_with
, etc.
Solution 2:
You can try:
select(data, matches("search_string"))
It is more general than contains
- you can use regex (e.g. "one_string|or_the_other"
).
For more examples, see: http://rpackages.ianhowson.com/cran/dplyr/man/select.html.
Solution 3:
No need to use select
just use [
instead
data[,grepl("search_string", colnames(data))]
Let's try with iris
dataset
>iris[,grepl("Sepal", colnames(iris))]
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
Solution 4:
Based on Piotr Migdals response I want to give an alternate solution enabling the possibility for a vector of strings:
myVectorOfStrings <- c("foo", "bar")
matchExpression <- paste(myVectorOfStrings, collapse = "|")
# [1] "foo|bar"
df %>% select(matches(matchExpression))
Making use of the regex OR
operator (|
)
ATTENTION: If you really have a plain vector of column names (and do not need the power of RegExpression), please see the comment below this answer (since it's the cleaner solution).