Replacing NAs with latest non-NA value
You probably want to use the na.locf()
function from the zoo package to carry the last observation forward to replace your NA values.
Here is the beginning of its usage example from the help page:
library(zoo)
az <- zoo(1:6)
bz <- zoo(c(2,NA,1,4,5,2))
na.locf(bz)
1 2 3 4 5 6
2 2 1 4 5 2
na.locf(bz, fromLast = TRUE)
1 2 3 4 5 6
2 1 1 4 5 2
cz <- zoo(c(NA,9,3,2,3,2))
na.locf(cz)
2 3 4 5 6
9 3 2 3 2
Sorry for digging up an old question. I couldn't look up the function to do this job on the train, so I wrote one myself.
I was proud to find out that it's a tiny bit faster.
It's less flexible though.
But it plays nice with ave
, which is what I needed.
repeat.before = function(x) { # repeats the last non NA value. Keeps leading NA
ind = which(!is.na(x)) # get positions of nonmissing values
if(is.na(x[1])) # if it begins with a missing, add the
ind = c(1,ind) # first position to the indices
rep(x[ind], times = diff( # repeat the values at these indices
c(ind, length(x) + 1) )) # diffing the indices + length yields how often
} # they need to be repeated
x = c(NA,NA,'a',NA,NA,NA,NA,NA,NA,NA,NA,'b','c','d',NA,NA,NA,NA,NA,'e')
xx = rep(x, 1000000)
system.time({ yzoo = na.locf(xx,na.rm=F)})
## user system elapsed
## 2.754 0.667 3.406
system.time({ yrep = repeat.before(xx)})
## user system elapsed
## 0.597 0.199 0.793
Edit
As this became my most upvoted answer, I was reminded often that I don't use my own function, because I often need zoo's maxgap
argument. Because zoo has some weird problems in edge cases when I use dplyr + dates that I couldn't debug, I came back to this today to improve my old function.
I benchmarked my improved function and all the other entries here. For the basic set of features, tidyr::fill
is fastest while also not failing the edge cases. The Rcpp entry by @BrandonBertelsen is faster still, but it's inflexible regarding the input's type (he tested edge cases incorrectly due to a misunderstanding of all.equal
).
If you need maxgap
, my function below is faster than zoo (and doesn't have the weird problems with dates).
I put up the documentation of my tests.
new function
repeat_last = function(x, forward = TRUE, maxgap = Inf, na.rm = FALSE) {
if (!forward) x = rev(x) # reverse x twice if carrying backward
ind = which(!is.na(x)) # get positions of nonmissing values
if (is.na(x[1]) && !na.rm) # if it begins with NA
ind = c(1,ind) # add first pos
rep_times = diff( # diffing the indices + length yields how often
c(ind, length(x) + 1) ) # they need to be repeated
if (maxgap < Inf) {
exceed = rep_times - 1 > maxgap # exceeding maxgap
if (any(exceed)) { # any exceed?
ind = sort(c(ind[exceed] + 1, ind)) # add NA in gaps
rep_times = diff(c(ind, length(x) + 1) ) # diff again
}
}
x = rep(x[ind], times = rep_times) # repeat the values at these indices
if (!forward) x = rev(x) # second reversion
x
}
I've also put the function in my formr package (Github only).
a data.table
solution:
dt <- data.table(y = c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA))
dt[, y_forward_fill := y[1], .(cumsum(!is.na(y)))]
dt
y y_forward_fill
1: NA NA
2: 2 2
3: 2 2
4: NA 2
5: NA 2
6: 3 3
7: NA 3
8: 4 4
9: NA 4
10: NA 4
this approach could work with forward filling zeros as well:
dt <- data.table(y = c(0, 2, -2, 0, 0, 3, 0, -4, 0, 0))
dt[, y_forward_fill := y[1], .(cumsum(y != 0))]
dt
y y_forward_fill
1: 0 0
2: 2 2
3: -2 -2
4: 0 -2
5: 0 -2
6: 3 3
7: 0 3
8: -4 -4
9: 0 -4
10: 0 -4
this method becomes very useful on data at scale and where you would want to perform a forward fill by group(s), which is trivial with data.table
. just add the group(s) to the by
clause prior to the cumsum
logic.
dt <- data.table(group = sample(c('a', 'b'), 20, replace = TRUE), y = sample(c(1:4, rep(NA, 4)), 20 , replace = TRUE))
dt <- dt[order(group)]
dt[, y_forward_fill := y[1], .(group, cumsum(!is.na(y)))]
dt
group y y_forward_fill
1: a NA NA
2: a NA NA
3: a NA NA
4: a 2 2
5: a NA 2
6: a 1 1
7: a NA 1
8: a 3 3
9: a NA 3
10: a NA 3
11: a 4 4
12: a NA 4
13: a 1 1
14: a 4 4
15: a NA 4
16: a 3 3
17: b 4 4
18: b NA 4
19: b NA 4
20: b 2 2
You can use the data.table
function nafill
, available from data.table >= 1.12.3
.
library(data.table)
nafill(y, type = "locf")
# [1] NA 2 2 2 2 3 3 4 4 4
If your vector is a column in a data.table
, you can also update it by reference with setnafill
:
d <- data.table(x = 1:10, y)
setnafill(d, type = "locf", cols = "y")
d
# x y
# 1: 1 NA
# 2: 2 2
# 3: 3 2
# 4: 4 2
# 5: 5 2
# 6: 6 3
# 7: 7 3
# 8: 8 4
# 9: 9 4
# 10: 10 4
If you have NA
in several columns...
d <- data.table(x = c(1, NA, 2), y = c(2, 3, NA), z = c(4, NA, 5))
# x y z
# 1: 1 2 4
# 2: NA 3 NA
# 3: 2 NA 5
...you can fill them by reference in one go:
setnafill(d, type = "locf")
d
# x y z
# 1: 1 2 4
# 2: 1 3 4
# 3: 2 3 5
Note that:
Only double and integer data types are currently [
data.table 1.12.6
] supported.
The functionality will most likely soon be extended; see the open issue nafill, setnafill for character, factor and other types, where you also find a temporary workaround.