data.table equivalent of tidyr::complete()
tidyr::complete()
adds rows to a data.frame
for combinations of column values that are missing from the data. Example:
library(dplyr)
library(tidyr)
df <- data.frame(person = c(1,2,2),
observation_id = c(1,1,2),
value = c(1,1,1))
df %>%
tidyr::complete(person,
observation_id,
fill = list(value=0))
yields
# A tibble: 4 × 3
person observation_id value
<dbl> <dbl> <dbl>
1 1 1 1
2 1 2 0
3 2 1 1
4 2 2 1
where the value
of the combination person == 1
and observation_id == 2
that is missing in df
has been filled in with a value of 0.
What would be the equivalent of this in data.table
?
I reckon that the philosophy of data.table entails fewer specially-named functions for tasks than you'll find in the tidyverse, so some extra coding is required, like:
res = setDT(df)[
CJ(person = person, observation_id = observation_id, unique=TRUE),
on=.(person, observation_id)
]
After this, you still have to manually handle the filling of values for missing levels. We can use setnafill
to handle this efficiently & by-reference in recent versions of data.table
:
setnafill(res, fill = 0, cols = 'value')
See @Jealie's answer regarding a feature that will sidestep this.
Certainly, it's crazy that the column names have to be entered three times here. But on the other hand, one can write a wrapper:
completeDT <- function(DT, cols, defs = NULL){
mDT = do.call(CJ, c(DT[, ..cols], list(unique=TRUE)))
res = DT[mDT, on=names(mDT)]
if (length(defs))
res[, names(defs) := Map(replace, .SD, lapply(.SD, is.na), defs), .SDcols=names(defs)]
res[]
}
completeDT(setDT(df), cols = c("person", "observation_id"), defs = c(value = 0))
person observation_id value
1: 1 1 1
2: 1 2 0
3: 2 1 1
4: 2 2 1
As a quick way of avoiding typing the names three times for the first step, here's @thelatemail's idea:
vars <- c("person","observation_id")
df[do.call(CJ, c(mget(vars), unique=TRUE)), on=vars]
# or with magrittr...
c("person","observation_id") %>% df[do.call(CJ, c(mget(.), unique=TRUE)), on=.]
Update: now you don't need to enter names twice in CJ thanks to @MichaelChirico & @MattDowle for the improvement.
There might be a better answer out there, but this works:
dt[CJ(person=unique(dt$person),
observation_id=unique(dt$observation_id)),
on=c('person','observation_id')]
Which gives:
person observation_id value
1: 1 1 1
2: 2 1 1
3: 1 2 NA
4: 2 2 1
Now, if you would like to be able to fill with any value (and not NA
), I would suggest to wait for the corresponding feature to be finished or contribute to it :)
It is worth noting that the completeDT
function above doesn't carry many of the features that tidyr::complete
does. In particular, empty factor levels are dropped - unlike tidyr::complete
which keeps them. If you do want to keep empty factor the function can be edited as below. The make_vals
function below could be made more sophisticated to handle other variable classes eg. full sequence for integers.
library(magrittr)
library(data.table)
dat <- data.frame(
person = c(1,2,2),
observation_id = factor(c(1,1,2), 1:3),
value = c(1,1,1))
dat %>%
tidyr::complete(
person, observation_id, fill = list(value=0))
#> # A tibble: 6 x 3
#> person observation_id value
#> <dbl> <fct> <dbl>
#> 1 1 1 1
#> 2 1 2 0
#> 3 1 3 0
#> 4 2 1 1
#> 5 2 2 1
#> 6 2 3 0
completeDT <- function(DT, cols, defs = NULL){
make_vals <- function(col) {
if(is.factor(col)) factor(levels(col))
else unique(col)
}
mDT = do.call(CJ, c(lapply(DT[, ..cols], make_vals), list(unique=TRUE)))
res = DT[mDT, on=names(mDT)]
if (length(defs))
res[, names(defs) := Map(replace, .SD, lapply(.SD, is.na), defs), .SDcols=names(defs)]
res[]
}
completeDT(DT = setDT(dat), cols = c("person", "observation_id"), defs = c(value = 0))
#> person observation_id value
#> 1: 1 1 1
#> 2: 1 2 0
#> 3: 1 3 0
#> 4: 2 1 1
#> 5: 2 2 1
#> 6: 2 3 0
Created on 2021-03-08 by the reprex package (v0.3.0)