Replace <NA> in a factor column
I want to replace <NA>
values in a factors column with a valid value. But I can not find a way. This example is only for demonstration. The original data comes from a foreign csv file I have to deal with.
df <- data.frame(a=sample(0:10, size=10, replace=TRUE),
b=sample(20:30, size=10, replace=TRUE))
df[df$a==0,'a'] <- NA
df$a <- as.factor(df$a)
Could look like this
a b
1 1 29
2 2 23
3 3 23
4 3 22
5 4 28
6 <NA> 24
7 2 21
8 4 25
9 <NA> 29
10 3 24
Now I want to replace the <NA>
values with a number.
df[is.na(df$a), 'a'] <- 88
In `[<-.factor`(`*tmp*`, iseq, value = c(88, 88)) :
invalid factor level, NA generated
I think I missed a fundamental R concept about factors. Am I?
I can not understand why it doesn't work. I think invalid factor level
means that 88
is not a valid level in that factor, right? So I have to tell the factor column that there is another level?
Solution 1:
1) addNA If fac
is a factor addNA(fac)
is the same factor but with NA added as a level. See ?addNA
To force the NA level to be 88:
facna <- addNA(fac)
levels(facna) <- c(levels(fac), 88)
giving:
> facna
[1] 1 2 3 3 4 88 2 4 88 3
Levels: 1 2 3 4 88
1a) This can be written in a single line as follows:
`levels<-`(addNA(fac), c(levels(fac), 88))
2) factor It can also be done in one line using the various arguments of factor
like this:
factor(fac, levels = levels(addNA(fac)), labels = c(levels(fac), 88), exclude = NULL)
2a) or equivalently:
factor(fac, levels = c(levels(fac), NA), labels = c(levels(fac), 88), exclude = NULL)
3) ifelse Another approach is:
factor(ifelse(is.na(fac), 88, paste(fac)), levels = c(levels(fac), 88))
4) forcats The forcats package has a function for this:
library(forcats)
fct_explicit_na(fac, "88")
## [1] 1 2 3 3 4 88 2 4 88 3
## Levels: 1 2 3 4 88
Note: We used the following for input fac
fac <- structure(c(1L, 2L, 3L, 3L, 4L, NA, 2L, 4L, NA, 3L), .Label = c("1",
"2", "3", "4"), class = "factor")
Update: Have improved (1) and added (1a). Later added (4).
Solution 2:
other way to do is:
#check levels
levels(df$a)
#[1] "3" "4" "7" "9" "10"
#add new factor level. i.e 88 in our example
df$a = factor(df$a, levels=c(levels(df$a), 88))
#convert all NA's to 88
df$a[is.na(df$a)] = 88
#check levels again
levels(df$a)
#[1] "3" "4" "7" "9" "10" "88"
Solution 3:
I had similar issues and I want to add what I consider the most pragmatic (and also tidy) solution:
Convert the column to a character
column, use mutate
and a simple ifelse
-statement to change the NA
values to what you want the factor level to be (I have chosen "None"), convert it back to a factor
column:
df %>% mutate(
a = as.character(a),
a = ifelse(is.na(a), "None", a),
a = as.factor(a)
)
Clean and painless because you do not actually have to dabble with NA
values when they occur in a factor
column. You bypass the weirdness and end up with a clean factor
variable.
Also, in response to the comment made below regarding multiple columns: You can wrap the statements in a function and use mutate_if
to select all factor variables or, if you know the names of the columns of concern, mutate_at
to apply that function:
replace_factor_na <- function(x){
x <- as.character(x)
x <- if_else(is.na(x), "None", x)
x <- as.factor(x)
}
df <- df %>%
mutate_if(is.factor, replace_factor_na)
Solution 4:
The basic concept of a factor variable is that it can only take specific values, i.e., the levels
. A value not in the levels
is invalid.
You have two possibilities:
If you have a variable that follows this concept, make sure to define all levels when you create it, even those without corresponding values.
Or make the variable a character variable and work with that.
PS: Often these problems result from data import. For instance, what you show there looks like it should be a numeric variable and not a factor variable.