An NA in subsetting a data.frame does something unexpected

Consider the following code. When you don't explicitly test for NA in your condition, that code will fail at some later date then your data changes.

>   # A toy example
>   a <- as.data.frame(cbind(col1=c(1,2,3,4),col2=c(2,NA,2,3),col3=c(1,2,3,4),col4=c(4,3,2,1)))
>   a
  col1 col2 col3 col4
1    1    2    1    4
2    2   NA    2    3
3    3    2    3    2
4    4    3    4    1
>   
>   # Bummer, there's an NA in my condition
>   a$col2==2
[1]  TRUE    NA  TRUE FALSE
> 
>   # Why is this a good thing to do?
>   # It NA'd the whole row, and kept it
>   a[a$col2==2,]
   col1 col2 col3 col4
1     1    2    1    4
NA   NA   NA   NA   NA
3     3    2    3    2
>   
>   # Yes, this is the right way to do it
>   a[!is.na(a$col2) & a$col2==2,]
  col1 col2 col3 col4
1    1    2    1    4
3    3    2    3    2
>     
>   # Subset seems designed to avoid this problem
>   subset(a, col2 == 2)
  col1 col2 col3 col4
1    1    2    1    4
3    3    2    3    2

Can someone explain why the behavior you get without the is.na check would ever be good or useful?


Solution 1:

I definitely agree that this isn't intuitive (I made that point before on SO). In defense of R, I think that knowing when you have a missing value is useful (i.e. this is not a bug). The == operator is explicitly designed to notify the user of NA or NaN values. See ?"==" for more information. It states:

Missing values ('NA') and 'NaN' values are regarded as non-comparable even to themselves, so comparisons involving them will always result in 'NA'.

In other words, a missing value isn't comparable using a binary operator (because it's unknown).

Beyond is.na(), you could also do:

which(a$col2==2) # tests explicitly for TRUE

Or

a$col2 %in% 2 # only checks for 2

%in% is defined as using the match() function:

'"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0'

This is also covered in "The R Inferno".

Checking for NA values in your data is crucial in R, because many important operators don't handle it the way you expect. Beyond ==, this is also true for things like &, |, <, sum(), and so on. I am always thinking "what would happen if there was an NA here" when I'm writing R code. Requiring an R user to be careful with missing values is "by design".

Update: How is NA handled when there are multiple logical conditions?

NA is a logical constant and you might get unexpected subsetting if you don't think about what might be returned (e.g. NA | TRUE == TRUE). These truth tables from ?Logic may provide a useful illustration:

outer(x, x, "&") ## AND table
#       <NA> FALSE  TRUE
#<NA>     NA FALSE    NA
#FALSE FALSE FALSE FALSE
#TRUE     NA FALSE  TRUE

outer(x, x, "|") ## OR  table
#      <NA> FALSE TRUE
#<NA>    NA    NA TRUE
#FALSE   NA FALSE TRUE
#TRUE  TRUE  TRUE TRUE