Fastest way to detect if vector has at least 1 NA?
What is the fastest way to detect if a vector has at least 1 NA
in R? I've been using:
sum( is.na( data ) ) > 0
But that requires examining each element, coercion, and the sum function.
Solution 1:
I'm thinking:
any(is.na(data))
should be slightly faster.
Solution 2:
As of R 3.1.0 anyNA()
is the way to do this. On atomic vectors this will stop after the first NA instead of going through the entire vector as would be the case with any(is.na())
. Additionally, this avoids creating an intermediate logical vector with is.na
that is immediately discarded. Borrowing Joran's example:
x <- y <- runif(1e7)
x[1e4] <- NA
y[1e7] <- NA
microbenchmark::microbenchmark(any(is.na(x)), anyNA(x), any(is.na(y)), anyNA(y), times=10)
# Unit: microseconds
# expr min lq mean median uq
# any(is.na(x)) 13444.674 13509.454 21191.9025 13639.3065 13917.592
# anyNA(x) 6.840 13.187 13.5283 14.1705 14.774
# any(is.na(y)) 165030.942 168258.159 178954.6499 169966.1440 197591.168
# anyNA(y) 7193.784 7285.107 7694.1785 7497.9265 7865.064
Notice how it is substantially faster even when we modify the last value of the vector; this is in part because of the avoidance of the intermediate logical vector.
Solution 3:
We mention this in some of our Rcpp presentations and actually have some benchmarks which show a pretty large gain from embedded C++ with Rcpp over the R solution because
a vectorised R solution still computes every single element of the vector expression
if your goal is to just satisfy
any()
, then you can abort after the first match -- which is what our Rcpp sugar (in essence: some C++ template magic to make C++ expressions look more like R expressions, see this vignette for more) solution does.
So by getting a compiled specialised solution to work, we do indeed get a fast solution. I should add that while I have not compared this to the solutions offered in this SO question here, I am reasonably confident about the performance.
Edit And the Rcpp package contains examples in the directory sugarPerformance
. It has an increase of the several thousand of the 'sugar-can-abort-soon' over 'R-computes-full-vector-expression' for any()
, but I should add that that case does not involve is.na()
but a simple boolean expression.