.EACHI in data.table?
I cannot seem to find any documentation on what exactly .EACHI
does in data.table
. I see a brief mention of it in the documentation:
Aggregation for a subset of known groups is particularly efficient when passing those groups in i and setting
by=.EACHI
. Wheni
is a data.table,DT[i,j,by=.EACHI]
evaluatesj
for the groups ofDT
that each row ini
joins to. We call this grouping by each i.
But what does "groups" in the context of DT
mean? Is a group determined by the key that is set on DT
? Is the group every distinct row that uses all the columns as the key? I fully understand how to run something like DT[i,j,by=my_grouping_variable]
but am confused as to how .EACHI
would work. Could someone explain please?
Solution 1:
I've added this to the list here. And hopefully we'll be able to deliver as planned.
The reason is most likely that by=.EACHI
is a recent feature (since 1.9.4), but what it does isn't. Let me explain with an example. Suppose we have two data.tables X
and Y
:
X = data.table(x = c(1,1,1,2,2,5,6), y = 1:7, key = "x")
Y = data.table(x = c(2,6), z = letters[2:1], key = "x")
We know that we can join by doing X[Y]
. this is similar to a subset operation, but using data.tables
(instead of integers / row names or logical values). For each row in Y
, taking Y
's key columns, it finds and returns corresponding matching rows in X
's key columns (+ columns in Y
) .
X[Y]
# x y z
# 1: 2 4 b
# 2: 2 5 b
# 3: 6 7 a
Now let's say we'd like to, for each row from Y
's key columns (here only one key column), we'd like to get the count of matches in X
. In versions of data.table
< 1.9.4, we can do this by simply specifying .N
in j
as follows:
# < 1.9.4
X[Y, .N]
# x N
# 1: 2 2
# 2: 6 1
What this implicitly does is, in the presence of j
, evaluate the j-expression
on each matched result of X
(corresponding to the row in Y
). This was called by-without-by or implicit-by, because it's as if there's a hidden by.
The issue was that this'll always perform a by
operation. So, if we wanted to know the number of rows after a join, then we'd have to do: X[Y][ .N]
(or simply nrow(X[Y])
in this case). That is, we can't have the j
expression in the same call if we don't want a by-without-by
. As a result, when we did for example X[Y, list(z)]
, it evaluated list(z)
using by-without-by
and was therefore slightly slower.
Additionally data.table
users requested this to be explicit - see this and this for more context.
Hence by=.EACHI
was added. Now, when we do:
X[Y, .N]
# [1] 3
it does what it's meant to do (avoids confusion). It returns the number of rows resulting from the join.
And,
X[Y, .N, by=.EACHI]
evaluates j
-expression on the matching rows for each row in Y
(corresponding to value from Y
's key columns here). It'd be easier to see this by using which=TRUE
.
X[.(2), which=TRUE] # [1] 4 5
X[.(6), which=TRUE] # [1] 7
If we run .N
for each, then we should get 2,1.
X[Y, .N, by=.EACHI]
# x N
# 1: 2 2
# 2: 6 1
So we now have both functionalities. Hope this helps.