Perform a semi-join with data.table

Solution 1:

More possibilities :

w = unique(x[y,which=TRUE])  # the row numbers in x which have a match from y
x[w]

If there are duplicate key values in x, then that needs :

w = unique(x[y,which=TRUE,allow.cartesian=TRUE])
x[w]

Or, the other way around :

setkey(y,x)
w = !is.na(y[x,which=TRUE,mult="first"])
x[w]

If nrow(x) << nrow(y) then the y[x] approach should be faster.
If nrow(x) >> nrow(y) then the x[y] approach should be faster.

But the anti anti join appeals too :-)

Solution 2:

One solution I can think of is:

tmp <- x[!y]
x[!tmp]

In data.table, you can have another data table as an i expression (i.e., the first expression in the data.table.[ call), and that will perform a join, e.g.:

x <- data.table(x = 1:10, y = letters[1:10])
setkey(x, x)
y <- data.table(x = c(1,3,5,1), z = 1:4)

> x[y]
   x y z
1: 1 a 1
2: 3 c 2
3: 5 e 3
4: 1 a 4

The ! before the i expression is an extension of the syntax above that performs a 'not-join', as described on p. 11 of data.table documentation. So the first assignments evaluates to a subset of x that doesn't have any rows where the key (column x) is present in y:

> x[!y]
    x y
1:  2 b
2:  4 d
3:  6 f
4:  7 g
5:  8 h
6:  9 i
7: 10 j

It is similar to setdiff in this regard. And therefore the second statement returns all the rows in x where the key is present in y.

The ! feature was added in data.table 1.8.4 with the following note in NEWS:

o   A new "!" prefix on i signals 'not-join' (a.k.a. 'not-where'), #1384i.
        DT[-DT["a", which=TRUE, nomatch=0]]   # old not-join idiom, still works
        DT[!"a"]                              # same result, now preferred.
        DT[!J(6),...]                         # !J == not-join
        DT[!2:3,...]                          # ! on all types of i
        DT[colA!=6L | colB!=23L,...]          # multiple vector scanning approach (slow)
        DT[!J(6L,23L)]                        # same result, faster binary search
    '!' has been used rather than '-' :
        * to match the 'not-join'/'not-where' nomenclature
        * with '-', DT[-0] would return DT rather than DT[0] and not be backwards
          compatible. With '!', DT[!0] returns DT both before (since !0 is TRUE in
          base R) and after this new feature.
        * to leave DT[+J...] and DT[-J...] available for future use

For some reason, the following doesn't work x[!(x[!y])] - probably data.table is too smart about parsing the argument.

P.S. As Josh O'Brien pointed in another answer, a one-line would be x[!eval(x[!y])].

Solution 3:

I'm confused with all the not-joins above, isn't what you want simply:

unique(x[y, .SD])
#   x y
#1: 1 a

If x can have duplicate keys, then you can unique y instead:

## Creating an example data.table 'a' three-times-repeated first row 
x <- data.table(x = c(1,1,1,2), y = c("a", "a", "a", "b"))
setkey(x, x)
y <- data.table(x = c(1, 1), z = 10:11)
setkey(y, x)

x[eval(unique(y, by = key(y))), .SD] # data.table >= 1.9.8 requires by=key(y)
#    x y
# 1: 1 a
# 2: 1 a
# 3: 1 a