Why is allow.cartesian required at times when when joining data.tables with duplicate keys?
I am trying to understand the logic of J() lookup when there're duplicate keys in a data.table in R.
Here's a little experiment I have tried:
library(data.table)
options(stringsAsFactors = FALSE)
x <- data.table(keyVar = c("a", "b", "c", "c"),
value = c( 1, 2, 3, 4))
setkey(x, keyVar)
y1 <- data.frame(name = c("d", "c", "a"))
x[J(y1$name), ]
## OK
y2 <- data.frame(name = c("d", "c", "a", "b"))
x[J(y2$name), ]
## Error: see below
x2 <- data.table(keyVar = c("a", "b", "c"),
value = c( 1, 2, 3))
setkey(x2, keyVar)
x2[J(y2$name), ]
## OK
The error message I am getting is :
Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), :
Join results in 5 rows; more than 4 = max(nrow(x),nrow(i)). Check for duplicate key
values in i, each of which join to the same group in x over and over again. If that's
ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group
to avoid the large allocation. If you are sure you wish to proceed, rerun with
allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki,
Stack Overflow and datatable-help for advice.
I don't really understand this. I know I should avoid duplicate keys in a lookup function, I just want to gain some insight so I won't make any error in the future.
Thanks a ton for help. This is a great tool.
You don't have to avoid duplicate keys. As long as the result does not get bigger than max(nrow(x), nrow(i))
, you won't get this error, even if you've duplicates. It is basically a precautionary measure.
When you've duplicate keys, the resulting join can sometimes get much bigger. Since data.table
knows the total number of rows that'll result from this join early enough, it provides this error message and asks you to use the argument allow.cartesian=TRUE
if you're really sure.
Here's an (exaggerated) example that illustrates the idea behind this error message:
require(data.table)
DT1 <- data.table(x=rep(letters[1:2], c(1e2, 1e7)),
y=1L, key="x")
DT2 <- data.table(x=rep("b", 3), key="x")
# not run
# DT1[DT2] ## error
dim(DT1[DT2, allow.cartesian=TRUE])
# [1] 30000000 2
The duplicates in DT2
resulted in 3 times the total number of "a" in DT1
(=1e7). Imagine if you performed the join with 1e4 values in DT2
, the results would explode! To avoid this, there's the allow.cartesian
argument which by default is FALSE.
That being said, I think Matt once mentioned that it maybe possible to just provide the error in case of "large" joins (or joins that results in huge number of rows - which might be set arbitrarily I guess). This, when/if implemented, will make the join properly without this error message in case of joins that don't combinatorially explode.