Select / assign to data.table when variable names are stored in a character vector
How do you refer to variables in a data.table
if the variable names are stored in a character vector? For instance, this works for a data.frame
:
df <- data.frame(col1 = 1:3)
colname <- "col1"
df[colname] <- 4:6
df
# col1
# 1 4
# 2 5
# 3 6
How can I perform this same operation for a data.table, either with or without :=
notation? The obvious thing of dt[ , list(colname)]
doesn't work (nor did I expect it to).
Solution 1:
Two ways to programmatically select variable(s):
-
with = FALSE
:DT = data.table(col1 = 1:3) colname = "col1" DT[, colname, with = FALSE] # col1 # 1: 1 # 2: 2 # 3: 3
-
'dot dot' (
..
) prefix:DT[, ..colname] # col1 # 1: 1 # 2: 2 # 3: 3
For further description of the 'dot dot' (..
) notation, see New Features in 1.10.2 (it is currently not described in help text).
To assign to variable(s), wrap the LHS of :=
in parentheses:
DT[, (colname) := 4:6]
# col1
# 1: 4
# 2: 5
# 3: 6
The latter is known as a column plonk, because you replace the whole column vector by reference. If a subset i
was present, it would subassign by reference. The parens around (colname)
is a shorthand introduced in version v1.9.4 on CRAN Oct 2014. Here is the news item:
Using
with = FALSE
with:=
is now deprecated in all cases, given that wrapping the LHS of:=
with parentheses has been preferred for some time.
colVar = "col1"
DT[, (colVar) := 1] # please change to this
DT[, c("col1", "col2") := 1] # no change
DT[, 2:4 := 1] # no change
DT[, c("col1","col2") := list(sum(a), mean(b))] # no change
DT[, `:=`(...), by = ...] # no change
See also Details section in ?`:=`
:
DT[i, (colnamevector) := value]
# [...] The parens are enough to stop the LHS being a symbol
And to answer further question in comment, here's one way (as usual there are many ways) :
DT[, colname := cumsum(get(colname)), with = FALSE]
# col1
# 1: 4
# 2: 9
# 3: 15
or, you might find it easier to read, write and debug just to eval
a paste
, similar to constructing a dynamic SQL statement to send to a server :
expr = paste0("DT[,",colname,":=cumsum(",colname,")]")
expr
# [1] "DT[,col1:=cumsum(col1)]"
eval(parse(text=expr))
# col1
# 1: 4
# 2: 13
# 3: 28
If you do that a lot, you can define a helper function EVAL
:
EVAL = function(...)eval(parse(text=paste0(...)),envir=parent.frame(2))
EVAL("DT[,",colname,":=cumsum(",colname,")]")
# col1
# 1: 4
# 2: 17
# 3: 45
Now that data.table
1.8.2 automatically optimizes j
for efficiency, it may be preferable to use the eval
method. The get()
in j
prevents some optimizations, for example.
Or, there is set()
. A low overhead, functional form of :=
, which would be fine here. See ?set
.
set(DT, j = colname, value = cumsum(DT[[colname]]))
DT
# col1
# 1: 4
# 2: 21
# 3: 66
Solution 2:
*This is not an answer really, but I don't have enough street cred to post comments :/
Anyway, for anyone who might be looking to actually create a new column in a data table with a name stored in a variable, I've got the following to work. I have no clue as to it's performance. Any suggestions for improvement? Is it safe to assume a nameless new column will always be given the name V1?
colname <- as.name("users")
# Google Analytics query is run with chosen metric and resulting data is assigned to DT
DT2 <- DT[, sum(eval(colname, .SD)), by = country]
setnames(DT2, "V1", as.character(colname))
Notice I can reference it just fine in the sum() but can't seem to get it to assign in the same step. BTW, the reason I need to do this is colname will be based on user input in a Shiny app.
Solution 3:
Retrieve multiple columns from data.table via variable or function:
library(data.table)
x <- data.table(this=1:2,that=1:2,whatever=1:2)
# === explicit call
x[, .(that, whatever)]
x[, c('that', 'whatever')]
# === indirect via variable
# ... direct assignment
mycols <- c('that','whatever')
# ... same as result of a function call
mycols <- grep('a', colnames(x), value=TRUE)
x[, ..mycols]
x[, .SD, .SDcols=mycols]
# === direct 1-liner usage
x[, .SD, .SDcols=c('that','whatever')]
x[, .SD, .SDcols=grep('a', colnames(x), value=TRUE)]
which all yield
that whatever
1: 1 1
2: 2 2
I find the .SDcols
way the most elegant.
Solution 4:
For multiple columns and a function applied on column values.
When updating the values from a function, the RHS must be a list object, so using a loop on .SD
with lapply
will do the trick.
The example below converts integer columns to numeric columns
a1 <- data.table(a=1:5, b=6:10, c1=letters[1:5])
sapply(a1, class) # show classes of columns
# a b c1
# "integer" "integer" "character"
# column name character vector
nm <- c("a", "b")
# Convert columns a and b to numeric type
a1[, j = (nm) := lapply(.SD, as.numeric ), .SDcols = nm ]
sapply(a1, class)
# a b c1
# "numeric" "numeric" "character"