Fastest way to add rows for missing time steps?
I have a column in my datasets where time periods (Time
) are integers ranging from a-b. Sometimes there might be missing time periods for any given group. I'd like to fill in those rows with NA
. Below is example data for 1 (of several 1000) group(s).
structure(list(Id = c(1, 1, 1, 1), Time = c(1, 2, 4, 5), Value = c(0.568780482159894,
-0.7207749516298, 1.24258192959273, 0.682123081696789)), .Names = c("Id",
"Time", "Value"), row.names = c(NA, 4L), class = "data.frame")
Id Time Value
1 1 1 0.5687805
2 1 2 -0.7207750
3 1 4 1.2425819
4 1 5 0.6821231
As you can see, Time 3 is missing. Often one or more could be missing. I can solve this on my own but am afraid I wouldn't be doing this the most efficient way. My approach would be to create a function that:
Generate a sequence of time periods from min(Time)
to max(Time)
Then do a setdiff
to grab missing Time
values.
Convert that vector to a data.frame
Pull unique identifier variables (Id
and others not listed above), and add that to this data.frame.
Merge the two.
Return from function.
So the entire process would then get executed as below:
# Split the data into individual data.frames by Id.
temp_list <- dlply(original_data, .(Id))
# pad each data.frame
tlist2 <- llply(temp_list, my_pad_function)
# collapse the list back to a data.frame
filled_in_data <- ldply(tlist2)
Better way to achieve this?
Solution 1:
Following up on comments with Ben Barnes and starting with his mydf3
:
DT = as.data.table(mydf3)
setkey(DT,Id,Time)
DT[CJ(unique(Id),seq(min(Time),max(Time)))]
Id Time Value Id2
[1,] 1 1 -0.262482283 2
[2,] 1 2 -1.423935165 2
[3,] 1 3 0.500523295 1
[4,] 1 4 -1.912687398 1
[5,] 1 5 -1.459766444 2
[6,] 1 6 -0.691736451 1
[7,] 1 7 NA NA
[8,] 1 8 0.001041489 2
[9,] 1 9 0.495820559 2
[10,] 1 10 -0.673167744 1
First 10 rows of 12800 printed.
setkey(DT,Id,Id2,Time)
DT[CJ(unique(Id),unique(Id2),seq(min(Time),max(Time)))]
Id Id2 Time Value
[1,] 1 1 1 NA
[2,] 1 1 2 NA
[3,] 1 1 3 0.5005233
[4,] 1 1 4 -1.9126874
[5,] 1 1 5 NA
[6,] 1 1 6 -0.6917365
[7,] 1 1 7 NA
[8,] 1 1 8 NA
[9,] 1 1 9 NA
[10,] 1 1 10 -0.6731677
First 10 rows of 25600 printed.
CJ
stands for Cross Join, see ?CJ
. The padding with NA
s happens because nomatch
by default is NA
. Set nomatch
to 0
instead to remove the no matches. If instead of padding with NA
s the prevailing row is required, just add roll=TRUE
. This can be more efficient than padding with NA
s and then filling NA
s afterwards. See the description of roll
in ?data.table
.
setkey(DT,Id,Time)
DT[CJ(unique(Id),seq(min(Time),max(Time))),roll=TRUE]
Id Time Value Id2
[1,] 1 1 -0.262482283 2
[2,] 1 2 -1.423935165 2
[3,] 1 3 0.500523295 1
[4,] 1 4 -1.912687398 1
[5,] 1 5 -1.459766444 2
[6,] 1 6 -0.691736451 1
[7,] 1 7 -0.691736451 1
[8,] 1 8 0.001041489 2
[9,] 1 9 0.495820559 2
[10,] 1 10 -0.673167744 1
First 10 rows of 12800 printed.
setkey(DT,Id,Id2,Time)
DT[CJ(unique(Id),unique(Id2),seq(min(Time),max(Time))),roll=TRUE]
Id Id2 Time Value
[1,] 1 1 1 NA
[2,] 1 1 2 NA
[3,] 1 1 3 0.5005233
[4,] 1 1 4 -1.9126874
[5,] 1 1 5 -1.9126874
[6,] 1 1 6 -0.6917365
[7,] 1 1 7 -0.6917365
[8,] 1 1 8 -0.6917365
[9,] 1 1 9 -0.6917365
[10,] 1 1 10 -0.6731677
First 10 rows of 25600 printed.
Instead of setting keys, you may use on
. CJ
also takes a unique
argument. A small example with two 'Id':
d <- data.table(Id = rep(1:2, 4:3), Time = c(1, 2, 4, 5, 2, 3, 4), val = 1:7)
d[CJ(Id, Time = seq(min(Time), max(Time)), unique = TRUE), on = .(Id, Time)]
# Id Time val
# 1: 1 1 1
# 2: 1 2 2
# 3: 1 3 NA
# 4: 1 4 3
# 5: 1 5 4
# 6: 2 1 NA
# 7: 2 2 5
# 8: 2 3 6
# 9: 2 4 7
# 10: 2 5 NA
In this particular case, where one of the vectors in CJ
was generated with seq
, the result needs to be named explictly in order to match the names specified in on
. When using bare variables in CJ
though (like 'Id' here), they are auto-named, like in data.table()
(from data.table 1.12.2
).
Solution 2:
You can use tidyr
for this.
Use tidyr::complete
to fill in rows for Time
, and by default the values are filled in with NA
.
Create Data
I extended the sample data to show that it works for multiple Id
s and even when within an Id
the full range of Time
is not present.
library(dplyr)
library(tidyr)
df <- tibble(
Id = c(1, 1, 1, 1, 2, 2, 2),
Time = c(1, 2, 4, 5, 2, 3, 5),
Value = c(0.56, -0.72, 1.24, 0.68, 1.46, 0.74, 0.99)
)
df
#> # A tibble: 7 x 3
#> Id Time Value
#> <dbl> <dbl> <dbl>
#> 1 1 1 0.56
#> 2 1 2 -0.72
#> 3 1 4 1.24
#> 4 1 5 0.68
#> 5 2 2 1.46
#> 6 2 3 0.74
#> 7 2 5 0.99
Fill in the missing rows
df %>% complete(nesting(Id), Time = seq(min(Time), max(Time), 1L))
#> # A tibble: 10 x 3
#> Id Time Value
#> <dbl> <dbl> <dbl>
#> 1 1 1 0.56
#> 2 1 2 -0.72
#> 3 1 3 NA
#> 4 1 4 1.24
#> 5 1 5 0.68
#> 6 2 1 NA
#> 7 2 2 1.46
#> 8 2 3 0.74
#> 9 2 4 NA
#> 10 2 5 0.99
Solution 3:
Please see Matthew Dowle's answer (by now, hopefully above).
Here's something that uses the data.table
package, and it may help when there is more than one ID variable. It may also be faster than merge
, depending on how you want your results. I'd be interested in benchmarking and/or suggested improvements.
First, create some more demanding data with two ID variables
library(data.table)
set.seed(1)
mydf3<-data.frame(Id=sample(1:100,10000,replace=TRUE),
Value=rnorm(10000))
mydf3<-mydf3[order(mydf3$Id),]
mydf3$Time<-unlist(by(mydf3,mydf3$Id,
function(x)sample(1:(nrow(x)+3),nrow(x)),simplify=TRUE))
mydf3$Id2<-sample(1:2,nrow(mydf3),replace=TRUE)
Create a function (This has been EDITED - see history)
padFun<-function(data,idvars,timevar){
# Coerce ID variables to character
data[,idvars]<-lapply(data[,idvars,drop=FALSE],as.character)
# Create global ID variable of all individual ID vars pasted together
globalID<-Reduce(function(...)paste(...,sep="SOMETHINGWACKY"),
data[,idvars,drop=FALSE])
# Create data.frame of all possible combinations of globalIDs and times
allTimes<-expand.grid(globalID=unique(globalID),
allTime=min(data[,timevar]):max(data[,timevar]),
stringsAsFactors=FALSE)
# Get the original ID variables back
allTimes2<-data.frame(allTimes$allTime,do.call(rbind,
strsplit(allTimes$globalID,"SOMETHINGWACKY")),stringsAsFactors=FALSE)
# Convert combinations data.frame to data.table with idvars and timevar as key
allTimesDT<-data.table(allTimes2)
setnames(allTimesDT,1:ncol(allTimesDT),c(timevar,idvars))
setkeyv(allTimesDT,c(idvars,timevar))
# Convert data to data.table with same variables as key
dataDT<-data.table(data,key=c(idvars,timevar))
# Join the two data.tables to create padding
res<-dataDT[allTimesDT]
return(res)
}
Use the function
(padded2<-padFun(data=mydf3,idvars=c("Id"),timevar="Time"))
# Id Time Value Id2
# [1,] 1 1 -0.262482283 2
# [2,] 1 2 -1.423935165 2
# [3,] 1 3 0.500523295 1
# [4,] 1 4 -1.912687398 1
# [5,] 1 5 -1.459766444 2
# [6,] 1 6 -0.691736451 1
# [7,] 1 7 NA NA
# [8,] 1 8 0.001041489 2
# [9,] 1 9 0.495820559 2
# [10,] 1 10 -0.673167744 1
# First 10 rows of 12800 printed.
(padded<-padFun(data=mydf3,idvars=c("Id","Id2"),timevar="Time"))
# Id Id2 Time Value
# [1,] 1 1 1 NA
# [2,] 1 1 2 NA
# [3,] 1 1 3 0.5005233
# [4,] 1 1 4 -1.9126874
# [5,] 1 1 5 NA
# [6,] 1 1 6 -0.6917365
# [7,] 1 1 7 NA
# [8,] 1 1 8 NA
# [9,] 1 1 9 NA
# [10,] 1 1 10 -0.6731677
# First 10 rows of 25600 printed.
The edited function splits the globalID into its component parts in the combination data.frame, before merging with the original data. This should (I think) be better.