How can one work fully generically in data.table in R with column names in variables
Solution 1:
Problem you are describing is not strictly related to data.table
.
Complex queries cannot be easily translated to code that machine can parse, thus we are not able to escape complexity in writing a query for complex operations.
You can try to imagine how to programmatically construct a query for the following data.table
query using dplyr
or SQL:
DT[, c(f1(v1, v2, opt=TRUE),
f2(v3, v4, v5, opt1=FALSE, opt2=TRUE),
lapply(.SD, f3, opt1=TRUE, opt2=FALSE))
, by=.(id1, id2)]
Assuming that all columns (id1
, id2
, v1
...v5
) or even options (opt
, opt1
, opt2
) should be passed as variables.
Because of complexity in expression of queries I don't think you could easily accomplish requirement stated in your question:
is simpler, more elegant, shorter, or easier to design or implement or understand than the one above or others that require frequent
quote
-ing andeval
-ing.
Although, comparing to other programming languages, base R provides very useful tools to deal with such problems.
You already found suggestions to use get
, mget
, DT[[col_name]]
, parse
, quote
, eval
.
- As you mentioned
DT[[col_name]]
might not play well withdata.table
optimizations, thus is not that useful here. -
parse
is probably the easiest way to construct complex queries as you can just operate on strings, but it doesn't provide basic language syntax validation. So you can ended up trying to parse a string that R parser does not accept. Additionally there is a security concern as presented in 2655#issuecomment-376781159. -
get
/mget
are the ones most commonly suggested to deal with such problems.get
andmget
are internally catch by[.data.table
and translated to expected columns. So you are assuming your arbitrary complex query will be able to be decomposed by[.data.table
and expected columns properly inputted. - Since you asked this question few years back, the new feature - dot-dot prefix - is being rolled out in recently. You prefix variable name using dot-dot to refer to a variable outside of the scope of current data.table. Similarly as you refer parent directory in file system. Internals behind dot-dot will be quite similar to
get
, variables having prefix will be de-referenced inside of[.data.table
. . In future releases dot-dot prefix may allow calls like:
col1="a"; col2="b"; col3="g"; col4="x"; col5="y"
DT[..col4==..col5, .(s1=sum(..col1), s2=sum(..col2)), by=..col3]
-
Personally I prefer
quote
andeval
instead.quote
andeval
is interpreted almost as written by hand from scratch. This method does not rely ondata.table
abilities to manage references to columns. We can expect all optimizations to work the same way as if you would write those queries by hand. I found it also easier to debug as at any point you can just print quoted expression to see what is actually passed todata.table
query. Additionally there is a less space for bugs to occur. Constructing complex queries using R language object is sometimes tricky, it is easy to wrap the procedure into function so it can be applied in different use cases and easily re-used. Important to note that this method is independent fromdata.table
. It uses R language constructs. You can find more information about that in official R Language Definition in Computing on the language chapter. -
What else?
- I submitted proposal of a new concept called macro in #1579. In short it is a wrapper on
DT[eval(qi), eval(qj), eval(qby)]
so you still have to operate on R language objects. You are welcome to put your comment there. - Recently I proposed another approach for metaprogramming interface in PR#4304. In short it plugs base R
substitute
functionality into[.data.table
using new argumentenv
.
- I submitted proposal of a new concept called macro in #1579. In short it is a wrapper on
Going to the example. Below I will show two ways to solve it. First one will use base R metaprogramming, second one will use metaprogramming for data.table proposed in PR#4304 (see above).
- Base R computing on the language
I will wrap all logic into do_vars
function. Calling do_vars(donot=TRUE)
will print expressions to be computed on data.table
instead of eval
them. Below code should be run just after the OP code.
expected = copy(new.table)
new.table = the.table[, list(asofdate=seq(from=ymd((year)*10^4+101), length.out=12, by="1 month")), by=year]
do_vars = function(x, y, vars, donot=FALSE) {
name.suffix = function(x, suffix) as.name(paste(x, suffix, sep="."))
do_var = function(var, x, y) {
substitute({
x[, .anntot := y[, rep(.var, each=12)]]
x[, .monthly := y[, rep(.var/12, each=12)]]
x[, .rolling := rollapply(.monthly, mean, width=12, fill=c(head(.monthly,1), tail(.monthly,1)))]
x[, .scaled := .anntot/sum(.rolling)*.rolling, by=year]
}, list(
.var=as.name(var),
.anntot=name.suffix(var, "annual.total"),
.monthly=name.suffix(var, "monthly"),
.rolling=name.suffix(var, "rolling"),
.scaled=name.suffix(var, "scaled")
))
}
ql = lapply(setNames(nm=vars), do_var, x, y)
if (donot) return(ql)
lapply(ql, eval.parent)
invisible(x)
}
do_vars(new.table, the.table, c("var1","var2","var3"))
all.equal(expected, new.table)
#[1] TRUE
we can preview queries
do_vars(new.table, the.table, c("var1","var2","var3"), donot=TRUE)
#$var1
#{
# x[, `:=`(var1.annual.total, y[, rep(var1, each = 12)])]
# x[, `:=`(var1.monthly, y[, rep(var1/12, each = 12)])]
# x[, `:=`(var1.rolling, rollapply(var1.monthly, mean, width = 12,
# fill = c(head(var1.monthly, 1), tail(var1.monthly, 1))))]
# x[, `:=`(var1.scaled, var1.annual.total/sum(var1.rolling) *
# var1.rolling), by = year]
#}
#
#$var2
#{
# x[, `:=`(var2.annual.total, y[, rep(var2, each = 12)])]
# x[, `:=`(var2.monthly, y[, rep(var2/12, each = 12)])]
# x[, `:=`(var2.rolling, rollapply(var2.monthly, mean, width = 12,
# fill = c(head(var2.monthly, 1), tail(var2.monthly, 1))))]
# x[, `:=`(var2.scaled, var2.annual.total/sum(var2.rolling) *
# var2.rolling), by = year]
#}
#
#$var3
#{
# x[, `:=`(var3.annual.total, y[, rep(var3, each = 12)])]
# x[, `:=`(var3.monthly, y[, rep(var3/12, each = 12)])]
# x[, `:=`(var3.rolling, rollapply(var3.monthly, mean, width = 12,
# fill = c(head(var3.monthly, 1), tail(var3.monthly, 1))))]
# x[, `:=`(var3.scaled, var3.annual.total/sum(var3.rolling) *
# var3.rolling), by = year]
#}
#
- Proposed data.table metaprogramming
expected = copy(new.table)
new.table = the.table[, list(asofdate=seq(from=ymd((year)*10^4+101), length.out=12, by="1 month")), by=year]
name.suffix = function(x, suffix) as.name(paste(x, suffix, sep="."))
do_var2 = function(var, x, y) {
x[, .anntot := y[, rep(.var, each=12)],
env = list(
.anntot = name.suffix(var, "annual.total"),
.var = var
)]
x[, .monthly := y[, rep(.var/12, each=12)],
env = list(
.monthly = name.suffix(var, "monthly"),
.var = var
)]
x[, .rolling := rollapply(.monthly, mean, width=12, fill=c(head(.monthly,1), tail(.monthly,1))),
env = list(
.rolling = name.suffix(var, "rolling"),
.monthly = name.suffix(var, "monthly")
)]
x[, .scaled := .anntot/sum(.rolling)*.rolling, by=year,
env = list(
.scaled = name.suffix(var, "scaled"),
.anntot = name.suffix(var, "annual.total"),
.rolling = name.suffix(var, "rolling")
)]
TRUE
}
sapply(setNames(nm=var.names), do_var2, new.table, the.table)
#var1 var2 var3
#TRUE TRUE TRUE
all.equal(expected, new.table)
#[1] TRUE
Data and updated OP code
library(data.table)
library(lubridate)
library(zoo)
the.table <- data.table(year=1991:1996,var1=floor(runif(6,400,1400)))
the.table[,`:=`(var2=var1/floor(runif(6,2,5)),
var3=var1/floor(runif(6,2,5)))]
# Replicate data across months
new.table <- the.table[, list(asofdate=seq(from=ymd((year)*10^4+101),
length.out=12,
by="1 month")),by=year]
# Do a complicated procedure to each variable in some group.
var.names <- c("var1","var2","var3")
for(varname in var.names) {
#As suggested in an answer to Link 3 above
#Convert the column name to a 'quote' object
quote.convert <- function(x) eval(parse(text=paste0('quote(',x,')')))
#Do this for every column name I'll need
varname <- quote.convert(varname)
anntot <- quote.convert(paste0(varname,".annual.total"))
monthly <- quote.convert(paste0(varname,".monthly"))
rolling <- quote.convert(paste0(varname,".rolling"))
scaled <- quote.convert(paste0(varname,".scaled"))
#Perform the relevant tasks, using eval()
#around every variable columnname I may want
new.table[,paste0(varname,".annual.total"):=
the.table[,rep(eval(varname),each=12)]]
new.table[,paste0(varname,".monthly"):=
the.table[,rep(eval(varname)/12,each=12)]]
new.table[,paste0(varname,".rolling"):=
rollapply(eval(monthly),mean,width=12,
fill=c(head(eval(monthly),1),
tail(eval(monthly),1)))]
new.table[,paste0(varname,".scaled"):=
eval(anntot)/sum(eval(rolling))*eval(rolling),
by=year]
}
Solution 2:
Thanks for the question. Your original approach goes a long way towards solving most of the issues.
Here I've tweaked the quoting function slightly, and changed the approach to parse and evaluate the entire RHS expression as a string instead of the individual variables.
The reasoning being:
- You probably don't want to be repeating yourself by declaring every variable you need to use at the start of the loop.
- Strings will scale better since they can be generated programmatically. I've added an example below that calculates row-wise percentages to illustrate this.
library(data.table)
library(lubridate)
library(zoo)
set.seed(1)
the.table <- data.table(year=1991:1996,var1=floor(runif(6,400,1400)))
the.table[,`:=`(var2=var1/floor(runif(6,2,5)),
var3=var1/floor(runif(6,2,5)))]
# Replicate data across months
new.table <- the.table[, list(asofdate=seq(from=ymd((year)*10^4+101),
length.out=12,
by="1 month")),by=year]
# function to paste, parse & evaluate arguments
evalp <- function(..., envir=parent.frame()) {eval(parse(text=paste0(...)), envir=envir)}
# Do a complicated procedure to each variable in some group.
var.names <- c("var1","var2","var3")
for(varname in var.names) {
# 1. For LHS, use paste0 to generate new column name as string (from @eddi's comment)
# 2. For RHS, use evalp
new.table[, paste0(varname, '.annual.total') := evalp(
'the.table[,rep(', varname, ',each=12)]'
)]
new.table[, paste0(varname, '.monthly') := evalp(
'the.table[,rep(', varname, '/12,each=12)]'
)]
# Need to add envir=.SD when working within the table
new.table[, paste0(varname, '.rolling') := evalp(
'rollapply(',varname, '.monthly,mean,width=12,
fill=c(head(', varname, '.monthly,1), tail(', varname, '.monthly,1)))'
, envir=.SD
)]
new.table[,paste0(varname, '.scaled'):= evalp(
varname, '.annual.total / sum(', varname, '.rolling) * ', varname, '.rolling'
, envir=.SD
)
,by=year
]
# Since we're working with strings, more freedom
# to work programmatically
new.table[, paste0(varname, '.row.percent') := evalp(
'the.table[,rep(', varname, '/ (', paste(var.names, collapse='+'), '), each=12)]'
)]
}
Solution 3:
I tried to do this in data.table thinking "this isn't so bad"... but after an embarrassing length of time, I gave up. Matt says something like 'do in pieces then join', but I couldn't figure out elegant ways to do these pieces, especially because the last one depends on previous steps.
I have to say, this is a pretty brilliantly constructed question, and I too encounter similar issues frequently. I love data.table, but I still struggle sometimes. I don't know if I'm struggling with data.table or the complexity of the problem.
Here is the incomplete approach I've taken.
Realistically I can imagine that in a normal process you would have more intermediate variables stored that would be useful for calculating these values.
library(data.table)
library(zoo)
## Example yearly data
set.seed(27)
DT <- data.table(year=1991:1996,
var1=floor(runif(6,400,1400)))
DT[ , var2 := var1 / floor(runif(6,2,5))]
DT[ , var3 := var1 / floor(runif(6,2,5))]
setkeyv(DT,colnames(DT)[1])
DT
## Convenience function
nonkey <- function(dt){colnames(dt)[!colnames(dt)%in%key(dt)]}
## Annual data expressed monthly
NewDT <- DT[, j=list(asofdate=as.IDate(paste(year, 1:12, 1, sep="-"))), by=year]
setkeyv(NewDT, colnames(NewDT)[1:2])
## Create annual data
NewDT_Annual <- NewDT[DT]
setnames(NewDT_Annual,
nonkey(NewDT_Annual),
paste0(nonkey(NewDT_Annual), ".annual.total"))
## Compute monthly data
NewDT_Monthly <- NewDT[DT[ , .SD / 12, keyby=list(year)]]
setnames(NewDT_Monthly,
nonkey(NewDT_Monthly),
paste0(nonkey(NewDT_Monthly), ".monthly"))
## Compute rolling stats
NewDT_roll <- NewDT_Monthly[j = lapply(.SD, rollapply, mean, width=12,
fill=c(.SD[1],tail(.SD, 1))),
.SDcols=nonkey(NewDT_Monthly)]
NewDT_roll <- cbind(NewDT_Monthly[,1:2,with=F], NewDT_roll)
setkeyv(NewDT_roll, colnames(NewDT_roll)[1:2])
setnames(NewDT_roll,
nonkey(NewDT_roll),
gsub(".monthly$",".rolling",nonkey(NewDT_roll)))
## Compute normalized values
## Compute "adjustment" table which is
## total of each variable, by year for rolling
## divided by
## original annual totals
## merge "adjustment values" in with monthly data, and then
## make a modified data.table which is each varaible * annual adjustment factor
## Merge everything
NewDT_Combined <- NewDT_Annual[NewDT_roll][NewDT_Monthly]