How does one change the levels of a factor column in a data.table
What is the correct way to change the levels of a factor
column in a data.table
(note: not data frame)
library(data.table)
mydt <- data.table(id=1:6, value=as.factor(c("A", "A", "B", "B", "B", "C")), key="id")
mydt[, levels(value)]
[1] "A" "B" "C"
I am looking for something like:
mydt[, levels(value) <- c("X", "Y", "Z")]
But of course, the above line does not work.
# Actual # Expected result
> mydt > mydt
id value id value
1: 1 A 1: 1 X
2: 2 A 2: 2 X
3: 3 B 3: 3 Y
4: 4 B 4: 4 Y
5: 5 B 5: 5 Y
6: 6 C 6: 6 Z
You can still set them the traditional way:
levels(mydt$value) <- c(...)
This should be plenty fast unless mydt
is very large since that traditional syntax copies the entire object. You could also play the un-factoring and refactoring game... but no one likes that game anyway.
To change the levels by reference with no copy of mydt
:
setattr(mydt$value,"levels",c(...))
but be sure to assign a valid levels vector (type character
of sufficient length) otherwise you'll end up with an invalid factor (levels<-
does some checking as well as copying).
I would rather go the traditional way of re-assignment to the factors
> mydt$value # This we what we had originally
[1] A A B B B C
Levels: A B C
> levels(mydt$value) # just checking the levels
[1] "A" "B" "C"
**# Meat of the re-assignment**
> levels(mydt$value)[levels(mydt$value)=="A"] <- "X"
> levels(mydt$value)[levels(mydt$value)=="B"] <- "Y"
> levels(mydt$value)[levels(mydt$value)=="C"] <- "Z"
> levels(mydt$value)
[1] "X" "Y" "Z"
> mydt # This is what we wanted
id value
1: 1 X
2: 2 X
3: 3 Y
4: 4 Y
5: 5 Y
6: 6 Z
As you probably notices, the meat of the re-assignment is very intuitive, it checks for the exact level
(use grepl
in case there's a fuzzy math, regular expressions or likewise)
levels(mydt$value)[levels(mydt$value)=="A"] <- "X"
This explicitly checks the value in the levels
of the variable under consideration and then reassigns X
(and so on) to it - The advantage- you explicitly KNOW what labeled what.
I find renaming levels as here levels(mydt$value) <- c("X","Y","Z")
very non-intuitive, since it just assigns X
to the 1st level it SEES in the data (so the order really matters)
PPS : In case of too many levels, use looping constructs.
You can also rename and add to your levels using a related approach, which can be very handy, especially if you are making a plot that needs more informative labels in a particular order (as opposed to the default):
f <- factor(c("a","b"))
levels(f) <- list(C = "C", D = "a", B = "b")
(modified from ?levels
)