Factors in R: more than an annoyance?

Solution 1:

You should use factors. Yes they can be a pain, but my theory is that 90% of why they're a pain is because in read.table and read.csv, the argument stringsAsFactors = TRUE by default (and most users miss this subtlety). I say they are useful because model fitting packages like lme4 use factors and ordered factors to differentially fit models and determine the type of contrasts to use. And graphing packages also use them to group by. ggplot and most model fitting functions coerce character vectors to factors, so the result is the same. However, you end up with warnings in your code:

lm(Petal.Length ~ -1 + Species, data=iris)

# Call:
# lm(formula = Petal.Length ~ -1 + Species, data = iris)

# Coefficients:
#     Speciessetosa  Speciesversicolor   Speciesvirginica  
#             1.462              4.260              5.552  

iris.alt <- iris
iris.alt$Species <- as.character(iris.alt$Species)
lm(Petal.Length ~ -1 + Species, data=iris.alt)

# Call:
# lm(formula = Petal.Length ~ -1 + Species, data = iris.alt)

# Coefficients:
#     Speciessetosa  Speciesversicolor   Speciesvirginica  
#             1.462              4.260              5.552  

Warning message: In model.matrix.default(mt, mf, contrasts) :

variable Species converted to a factor

One tricky thing is the whole drop=TRUE bit. In vectors this works well to remove levels of factors that aren't in the data. For example:

s <- iris$Species
s[s == 'setosa', drop=TRUE]
#  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa
s[s == 'setosa', drop=FALSE]
#  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica

However, with data.frames, the behavior of [.data.frame() is different: see this email or ?"[.data.frame". Using drop=TRUE on data.frames does not work as you'd imagine:

x <- subset(iris, Species == 'setosa', drop=TRUE)  # susbetting with [ behaves the same way
x$Species
#  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica

Luckily you can drop factors easily with droplevels() to drop unused factor levels for an individual factor or for every factor in a data.frame (since R 2.12):

x <- subset(iris, Species == 'setosa')
levels(x$Species)
# [1] "setosa"     "versicolor" "virginica" 
x <- droplevels(x)
levels(x$Species)
# [1] "setosa"

This is how to keep levels you've selected out from getting in ggplot legends.

Internally, factors are integers with an attribute level character vector (see attributes(iris$Species) and class(attributes(iris$Species)$levels)), which is clean. If you had to change a level name (and you were using character strings), this would be a much less efficient operation. And I change level names a lot, especially for ggplot legends. If you fake factors with character vectors, there's the risk that you'll change just one element, and accidentally create a separate new level.

Solution 2:

ordered factors are awesome, if I happen to love oranges and hate apples but don't mind grapes I don't need to manage some weird index to say so:

d <- data.frame(x = rnorm(20), f = sample(c("apples", "oranges", "grapes"), 20, replace = TRUE, prob = c(0.5, 0.25, 0.25)))
d$f <- ordered(d$f, c("apples", "grapes", "oranges"))
d[d$f >= "grapes", ]

Solution 3:

A factor is most analogous to an enumerated type in other languages. Its appropriate use is for a variable which can only take on one of prescribed set of values. In these cases, not every possible allowed value may be present in any particular set of data and the "empty" levels accurately reflect that.

Consider some examples. For some data which was collected all across the United States, the state should be recorded as a factor. In this case, the fact that no cases were collected from a particular state is relevant. There could have been data from that state, but there happened (for whatever reason, which may be a reason of interest) to not be. If hometown was collected, it would not be a factor. There is not a pre-stated set of possible hometowns. If data were collected from three towns rather than nationally, the town would be a factor: there are three choices that were given at the outset and if no relevant cases/data were found in one of those three towns, that is relevant.

Other aspects of factors, such as providing a way to give an arbitrary sort order to a set of strings, are useful secondary characteristics of factors, but are not the reason for their existence.