Cleaning `Inf` values from an R dataframe
In R, I have an operation which creates some Inf
values when I transform a dataframe.
I would like to turn these Inf
values into NA
values. The code I have is slow for large data, is there a faster way of doing this?
Say I have the following dataframe:
dat <- data.frame(a=c(1, Inf), b=c(Inf, 3), d=c("a","b"))
The following works in a single case:
dat[,1][is.infinite(dat[,1])] = NA
So I generalized it with following loop
cf_DFinf2NA <- function(x)
{
for (i in 1:ncol(x)){
x[,i][is.infinite(x[,i])] = NA
}
return(x)
}
But I don't think that this is really using the power of R.
Solution 1:
Option 1
Use the fact that a data.frame
is a list of columns, then use do.call
to recreate a data.frame
.
do.call(data.frame,lapply(DT, function(x) replace(x, is.infinite(x),NA)))
Option 2 -- data.table
You could use data.table
and set
. This avoids some internal copying.
DT <- data.table(dat)
invisible(lapply(names(DT),function(.name) set(DT, which(is.infinite(DT[[.name]])), j = .name,value =NA)))
Or using column numbers (possibly faster if there are a lot of columns):
for (j in 1:ncol(DT)) set(DT, which(is.infinite(DT[[j]])), j, NA)
Timings
# some `big(ish)` data
dat <- data.frame(a = rep(c(1,Inf), 1e6), b = rep(c(Inf,2), 1e6),
c = rep(c('a','b'),1e6),d = rep(c(1,Inf), 1e6),
e = rep(c(Inf,2), 1e6))
# create data.table
library(data.table)
DT <- data.table(dat)
# replace (@mnel)
system.time(na_dat <- do.call(data.frame,lapply(dat, function(x) replace(x, is.infinite(x),NA))))
## user system elapsed
# 0.52 0.01 0.53
# is.na (@dwin)
system.time(is.na(dat) <- sapply(dat, is.infinite))
# user system elapsed
# 32.96 0.07 33.12
# modified is.na
system.time(is.na(dat) <- do.call(cbind,lapply(dat, is.infinite)))
# user system elapsed
# 1.22 0.38 1.60
# data.table (@mnel)
system.time(invisible(lapply(names(DT),function(.name) set(DT, which(is.infinite(DT[[.name]])), j = .name,value =NA))))
# user system elapsed
# 0.29 0.02 0.31
data.table
is the quickest. Using sapply
slows things down noticeably.
Solution 2:
Use sapply
and is.na<-
> dat <- data.frame(a=c(1, Inf), b=c(Inf, 3), d=c("a","b"))
> is.na(dat) <- sapply(dat, is.infinite)
> dat
a b d
1 1 NA a
2 NA 3 b
Or you can use (giving credit to @mnel, whose edit this is),
> is.na(dat) <- do.call(cbind,lapply(dat, is.infinite))
which is significantly faster.
Solution 3:
Here is a dplyr/tidyverse solution using the na_if() function:
dat %>% mutate_if(is.numeric, list(~na_if(., Inf)))
Note that this only replaces positive infinity with NA. Need to repeat if negative infinity values also need to be replaced.
dat %>% mutate_if(is.numeric, list(~na_if(., Inf))) %>%
mutate_if(is.numeric, list(~na_if(., -Inf)))
Solution 4:
[<-
with mapply
is a bit faster than sapply
.
> dat[mapply(is.infinite, dat)] <- NA
With mnel's data, the timing is
> system.time(dat[mapply(is.infinite, dat)] <- NA)
# user system elapsed
# 15.281 0.000 13.750
Solution 5:
There is very simple solution to this problem in the hablar package:
library(hablar)
dat %>% rationalize()
Which return a data frame with all Inf are converted to NA.
Timings compared to some above solutions. Code: library(hablar) library(data.table)
dat <- data.frame(a = rep(c(1,Inf), 1e6), b = rep(c(Inf,2), 1e6),
c = rep(c('a','b'),1e6),d = rep(c(1,Inf), 1e6),
e = rep(c(Inf,2), 1e6))
DT <- data.table(dat)
system.time(dat[mapply(is.infinite, dat)] <- NA)
system.time(dat[dat==Inf] <- NA)
system.time(invisible(lapply(names(DT),function(.name) set(DT, which(is.infinite(DT[[.name]])), j = .name,value =NA))))
system.time(rationalize(dat))
Result:
> system.time(dat[mapply(is.infinite, dat)] <- NA)
user system elapsed
0.125 0.039 0.164
> system.time(dat[dat==Inf] <- NA)
user system elapsed
0.095 0.010 0.108
> system.time(invisible(lapply(names(DT),function(.name) set(DT, which(is.infinite(DT[[.name]])), j = .name,value =NA))))
user system elapsed
0.065 0.002 0.067
> system.time(rationalize(dat))
user system elapsed
0.058 0.014 0.072
>
Seems like data.table is faster than hablar. But has longer syntax.