How to deal with spaces in column names?

You asked "Is there a better general approach to dealing with the problem of spaces (and other characters) in variable names" and yes there are a few:

  • Just don't use them as things will break as you experienced here
  • Use the make.names() function to create safe names; this is used by R too to create identifiers (eg by using underscores for spaces etc)
  • If you must, protect the unsafe identifiers with backticks.

Example for the last two points:

R> myvec <- list("foo"=3.14, "some bar"=2.22)
R> myvec$'some bar' * 2
[1] 4.44
R> make.names(names(myvec))
[1] "foo"      "some.bar"
R> 

This is a "bug" in the package ggplot2 that comes from the fact that the function as.data.frame() in the internal ggplot2 function quoted_df converts the names to syntactically valid names. These syntactically valid names cannot be found in the original dataframe, hence the error.

To remind you :

syntactically valid names consists of letters, numbers and the dot or underline characters, and start with a letter or the dot (but the dot cannot be followed by a number)

There's a reason for that. There's also a reason why ggplot allows you to set labels using labs, eg using the following dummy dataset with valid names:

X <-data.frame(
  PonOAC = rep(c('a','b','c','d'),2),
  AgeGroup = rep(c("over 80",'under 80'),each=4),
  NumberofPractices = rpois(8,70)
  ) 

You can use labs at the end to make this code work

ggplot(X, aes(x=PonOAC,y=NumberofPractices, fill=AgeGroup)) +
  geom_bar() +
  facet_grid(AgeGroup~ .) + 
  labs(x="% on OAC", y="Number of Practices",fill = "Age Group")

To produce

enter image description here


A simple solution to multi-word column names is to simply separate them with an underscore character. It has some advantages over other conventions:

  • _ An underscore in a column name is valid
  • And underscore separates the words for readability
  • Camelcase can be tricky to read (consider s vs S and w vs W - similar letters can cause confusion, which can be problematic since R is case sensitive)
  • Using a period (.) in a column name is valid but often not ideal from a readability perspective, especially for anyone from languages other than R who may mistake the period for a method call (e.g. data.test could be a column name in R, but could look like the .test method is being called on the object data if someone is used to reading other languages, like ruby or python)
  • Using spaces in column names is valid, but when referencing those columns, it will be necessary to surround the column name with backticks i.e. the ` symbol
    • e.g. iris[ , Sepal Length`]

TL;DR Use the underscore to separate words in column names and you shouldn't have any problems (avoid spaces in column names, and if you data already has some, surround the full column name with backticks ` when referring to it in functions)