Row operations in data.table using `by = .I`
Note: section (3) of this answer updated in April 2019, due to many changes in data.table over time redering the original version obsolete. Also, use of the argument with=
removed from all instances of data.table, as it has since been deprecated.
1) Well, one reason not to use it, at least for the rowsums
example is performance, and creation of an unnecessary column. Compare to option f2 below, which is almost 4x faster and does not need the rowpos column (Note that the original question used rowSums
as the example function, to which this part of the answer responds. OP edited the question afterwards to use a different function, for which part 3 of this answer is more relevant`):
dt <- data.table(V0 =LETTERS[c(1,1,2,2,3)], V1=1:5, V2=3:7, V3=5:1)
f1 <- function(dt){
dt[, rowpos := .I]
dt[ , sdd := rowSums(.SD[, 2:4]), by = rowpos ] }
f2 <- function(dt) dt[, sdd := rowSums(.SD), .SDcols= 2:4]
library(microbenchmark)
microbenchmark(f1(dt),f2(dt))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# f1(dt) 3.669049 3.732434 4.013946 3.793352 3.972714 5.834608 100 b
# f2(dt) 1.052702 1.085857 1.154132 1.105301 1.138658 2.825464 100 a
2) On your second question, although dt[, sdd := sum(.SD[, 2:4]), by = .I]
does not work, dt[, sdd := sum(.SD[, 2:4]), by = 1:NROW(dt)]
works perfectly. Given that according to ?data.table
".I is an integer vector equal to seq_len(nrow(x))", one might expect these to be equivalent. The difference, however, is that .I
is for use in j
, not in by
. NB the value of .I
is calculated internally in data.table, so is not available beforehand to be passed in as a parameter value as in by=.I
.
It might also be expected that by = .I
should just throw an error. But this does not occur, because loading the data.table
package creates an object .I
in the data.table namespace that is accessible from the global environment, and whose value is NULL
. You can test this by typing .I
at the command prompt. (Note, the same applies to .SD
, .EACHI
, .N
, .GRP
, and .BY
)
.I
# Error: object '.I' not found
library(data.table)
.I
# NULL
data.table::.I
# NULL
The upshot of this is that the behaviour of by = .I
is equivalent to by = NULL
.
3) Although we have already seen in part 1 that in the case of rowSums
, which already loops row-wise efficiently, there are much faster ways than creating the rowpos column. But what about looping when we don't have a fast row-wise function?
Benchmarking the by = rowpos
and by = 1:NROW(dt)
versions against a for
loop with set()
is informative here. We find that looping over set
in a for
loop is slower than either of the methods that use data.table's by
argument for looping. However there is neglibible difference in timing between the by
loop that creates an additional column and the one that uses seq_len(NROW(dt))
. Absent any performance difference, it seems that f.nrow
is probably preferable, but only on the basis of being more concise and not creating an unnecessary column
dt <- data.table(V0 = rep(LETTERS[c(1,1,2,2,3)], 1e3), V1=1:5, V2=3:7, V3=5:1)
f.rowpos <- function() {
dt[, rowpos := .I]
dt[, sdd := sum(.SD[, 2:4]), by = rowpos ]
}
f.nrow <- function() {
dt[, sdd := sum(.SD[, 2:4]), by = seq_len(NROW(dt)) ]
}
f.forset<- function() {
for (i in seq_len(NROW(dt))) set(dt, i, 'sdd', sum(dt[i, 2:4]))
}
microbenchmark(f.rowpos(),f.nrow(), f.forset(), times = 5)
# Unit: milliseconds
# expr min lq mean median uq max neval
# f.rowpos() 559.1115 575.3162 580.2853 578.6865 588.5532 599.7591 5
# f.nrow() 558.4327 582.4434 584.6893 587.1732 588.6689 606.7282 5
# f.forset() 1172.6560 1178.8399 1298.4842 1255.4375 1292.7393 1592.7486 5
So, in conclusion, even in situations where there is not an optimised function such as rowSums
that already operates by row, there are alternatives to using a rowpos column that, although not faster, don't require creation of a redundant column.