Left join using data.table
Suppose I have two data.table's:
A:
A B
1: 1 12
2: 2 13
3: 3 14
4: 4 15
B:
A B
1: 2 13
2: 3 14
and I have the following code:
merge_test = merge(dataA, dataB, by="A", all.data=TRUE)
I get:
A B.x B.y
1: 2 13 13
2: 3 14 14
However, I want all the rows in dataA in the final merged table. Is there a way to do this?
Solution 1:
If you want to add the b
values of B
to A
, then it's best to join A
with B
and update A
by reference as follows:
A[B, on = 'a', bb := i.b]
which gives:
> A a b bb 1: 1 12 NA 2: 2 13 13 3: 3 14 14 4: 4 15 NA
This is a better approach than using B[A, on='a']
because the latter just prints the result to the console. When you want to get the results back into A
, you need to use A <- B[A, on='a']
which will give you the same result.
The reason why A[B, on = 'a', bb := i.b]
is better than A <- B[A, on = 'a']
is memory efficiency. With A[B, on = 'a', bb := i.b]
the location of A
in memory stays the same:
> address(A) [1] "0x102afa5d0" > A[B, on = 'a', bb := i.b] > address(A) [1] "0x102afa5d0"
While on the other hand with A <- B[A, on = 'a']
, a new object is created and saved in memory as A
and hence has another location in memory:
> address(A) [1] "0x102abae50" > A <- B[A, on = 'a'] > address(A) [1] "0x102aa7e30"
Using merge
(merge.data.table
) results in a similar change in memory location:
> address(A) [1] "0x111897e00" > A <- merge(A, B, by = 'a', all.x = TRUE) > address(A) [1] "0x1118ab000"
For memory efficiency it is thus better to use an 'update-by-reference-join' syntax:
A[B, on = 'a', bb := i.b]
Although this doesn't make a noticeable difference with small datasets like these, it does make a difference on large datasets for which data.table
was designed.
Probably also worth mentioning is that the order of A
stays the same.
To see the effect on speed and memory use, let's benchmark with some larger datasets (for data, see the 2nd part of the used data-section below):
library(bench)
bm <- mark(AA <- BB[AA, on = .(aa)],
AA[BB, on = .(aa), cc := cc],
iterations = 1)
which gives (only relevant measurements shown):
> bm[,c(1,3,5)] # A tibble: 2 x 3 expression median mem_alloc <bch:expr> <bch:tm> <bch:byt> 1 AA <- BB[AA, on = .(aa)] 4.98s 4.1GB 2 AA[BB, on = .(aa), `:=`(cc, cc)] 560.88ms 384.6MB
So, in this setup the 'update-by-reference-join' is about 9 times faster and consumes 11 times less memory.
NOTE: Gains in speed and memory use might differ in different setups.
Used data:
# initial datasets
A <- data.table(a = 1:4, b = 12:15)
B <- data.table(a = 2:3, b = 13:14)
# large datasets for the benchmark
set.seed(2019)
AA <- data.table(aa = 1:1e8, bb = sample(12:19, 1e7, TRUE))
BB <- data.table(aa = sample(AA$a, 2e5), cc = sample(2:8, 2e5, TRUE))
Solution 2:
You can try this:
# used data
# set the key in 'B' to the column which you use to join
A <- data.table(a = 1:4, b = 12:15)
B <- data.table(a = 2:3, b = 13:14, key = 'a')
B[A]