Select values from different columns based on a variable containing column names [duplicate]
Solution 1:
An excuse to use the obscure .BY
:
DT[, newval := .SD[[.BY[[1]]]], by=new]
col1 col2 col3 new newval
1: 1 4 55 col1 1
2: 2 3 44 col2 3
3: 3 34 35 col2 34
4: 4 44 87 col3 87
How it works. This splits the data into groups based on the strings in new
. The value of the string for each group is stored in newname = .BY[[1]]
. We use this string to select the corresponding column of .SD
via .SD[[newname]]
. .SD
stands for Subset of Data.
Alternatives. get(.BY[[1]])
should work just as well in place of .SD[[.BY[[1]]]]
. According to a benchmark run by @David, the two ways are equally fast.
Solution 2:
We can match
the 'new' column with the column names of the dataset to get the column index, cbind
with the row index (1:nrow(df1)
) and extract the corresponding elements of the dataset based on row/column index. It can be assigned to a new column.
df1$matched_value <- df1[-4][cbind(1:nrow(df1),match(df1$new, colnames(df1) ))]
df1
# col1 col2 col3 new matched_value
#1 1 4 55 col1 1
#2 2 3 44 col2 3
#3 3 34 35 col2 34
#4 4 44 87 col3 87
NOTE: If the OP have a data.table
, one option is convert to data.frame
or use with=FALSE
while subsetting.
setDF(df1) #to convert to 'data.frame'.
Benchmarks
set.seed(45)
df2 <- data.frame(col1= sample(1:9, 20e6, replace=TRUE),
col2= sample(1:20, 20e6, replace=TRUE),
col3= sample(1:40, 20e6, replace=TRUE),
col4=sample(1:30, 20e6, replace=TRUE),
new= sample(paste0('col', 1:4), 20e6, replace=TRUE), stringsAsFactors=FALSE)
system.time(df2$matched_value <- df2[-5][cbind(1:nrow(df2),match(df2$new, colnames(df2) ))])
# user system elapsed
# 2.54 0.37 2.92