Efficiently merging two data frames on a non-trivial criteria
Answering this question last night, I spent a good hour trying to find a solution that didn't grow a data.frame
in a for loop, without any success, so I'm curious if there's a better way to go about this problem.
The general case of the problem boils down to this:
- Merge two
data.frames
- Entries in either
data.frame
can have 0 or more matching entries in the other. - We only care about entries that have 1 or more matches across both.
- The match function is complex involving multiple columns in both
data.frame
s
For a concrete example I will use similar data to the linked question:
genes <- data.frame(gene = letters[1:5],
chromosome = c(2,1,2,1,3),
start = c(100, 100, 500, 350, 321),
end = c(200, 200, 600, 400, 567))
markers <- data.frame(marker = 1:10,
chromosome = c(1, 1, 2, 2, 1, 3, 4, 3, 1, 2),
position = c(105, 300, 96, 206, 150, 400, 25, 300, 120, 700))
And our complex matching function:
# matching criteria, applies to a single entry from each data.frame
isMatch <- function(marker, gene) {
return(
marker$chromosome == gene$chromosome &
marker$postion >= (gene$start - 10) &
marker$postion <= (gene$end + 10)
)
}
The output should look like an sql
INNER JOIN
of the two data.frames, for entries where isMatch
is TRUE
.
I've tried to construct the two data.frames
so that there can be 0 or more matches in the other data.frame
.
The solution I came up with is as follows:
joined <- data.frame()
for (i in 1:nrow(genes)) {
# This repeated subsetting returns the same results as `isMatch` applied across
# the `markers` data.frame for each entry in `genes`.
matches <- markers[which(markers$chromosome == genes[i, "chromosome"]),]
matches <- matches[which(matches$pos >= (genes[i, "start"] - 10)),]
matches <- matches[which(matches$pos <= (genes[i, "end"] + 10)),]
# matches may now be 0 or more rows, which we want to repeat the gene for:
if(nrow(matches) != 0) {
joined <- rbind(joined, cbind(genes[i,], matches[,c("marker", "position")]))
}
}
Giving the results:
gene chromosome start end marker position
1 a 2 100 200 3 96
2 a 2 100 200 4 206
3 b 1 100 200 1 105
4 b 1 100 200 5 150
5 b 1 100 200 9 120
51 e 3 321 567 6 400
This is quite an ugly and clungy solution, but anything else I tried was met with failure:
- use of
apply
, gave me alist
where each element was a matrix, with no way torbind
them. - I can't specify the dimensions of
joined
first, because I don't know how many rows I will need in the end.
I'm sure I will come up with a problem of this general form in the future. So what's the correct way to solve this kind of problem?
Solution 1:
A data table solution: a rolling join to fulfill the first inequality, followed by a vector scan to satisfy the second inequality. The join-on-first-inequality will have more rows than the final result (and therefore may run into memory issues), but it will be smaller than a straight-up merge in this answer.
require(data.table)
genes_start <- as.data.table(genes)
## create the start bound as a separate column to join to
genes_start[,`:=`(start_bound = start - 10)]
setkey(genes_start, chromosome, start_bound)
markers <- as.data.table(markers)
setkey(markers, chromosome, position)
new <- genes_start[
##join genes to markers
markers,
##rolling the last key column of genes_start (start_bound) forward
##to match the last key column of markers (position)
roll = Inf,
##inner join
nomatch = 0
##rolling join leaves positions column from markers
##with the column name from genes_start (start_bound)
##now vector scan to fulfill the other criterion
][start_bound <= end + 10]
##change names and column order to match desired result in question
setnames(new,"start_bound","position")
setcolorder(new,c("chromosome","gene","start","end","marker","position"))
# chromosome gene start end marker position
# 1: 1 b 100 200 1 105
# 2: 1 b 100 200 9 120
# 3: 1 b 100 200 5 150
# 4: 2 a 100 200 3 96
# 5: 2 a 100 200 4 206
# 6: 3 e 321 567 6 400
One could do a double join, but as it involves re-keying the data table before the second join, I don't think that it will be faster than the vector scan solution above.
##makes a copy of the genes object and keys it by end
genes_end <- as.data.table(genes)
genes_end[,`:=`(end_bound = end + 10, start = NULL, end = NULL)]
setkey(genes_end, chromosome, gene, end_bound)
## as before, wrapped in a similar join (but rolling backwards this time)
new_2 <- genes_end[
setkey(
genes_start[
markers,
roll = Inf,
nomatch = 0
], chromosome, gene, start_bound),
roll = -Inf,
nomatch = 0
]
setnames(new2, "end_bound", "position")
Solution 2:
I dealt with a very similar problem myself by doing the merge, and sorting out which rows satisfy the condition afterwards. I don't claim that this is a universal solution, if you're dealing with large datasets where there will be few entries that match the condition, this will likely be inefficient. But to adapt it to your data:
joined.raw <- merge(genes, markers)
joined <- joined.raw[joined.raw$position >= (joined.raw$start -10) & joined.raw$position <= (joined.raw$end + 10),]
joined
# chromosome gene start end marker position
# 1 1 b 100 200 1 105
# 2 1 b 100 200 5 150
# 4 1 b 100 200 9 120
# 10 2 a 100 200 4 206
# 11 2 a 100 200 3 96
# 16 3 e 321 567 6 400
Solution 3:
Another answer I've come up with using the sqldf
package.
sqldf("SELECT gene, genes.chromosome, start, end, marker, position
FROM genes JOIN markers ON genes.chromosome = markers.chromosome
WHERE position >= (start - 10) AND position <= (end + 10)")
Using microbenchmark
it performs comparably to @alexwhan's merge
and [
method.
> microbenchmark(alexwhan, sql)
Unit: nanoseconds
expr min lq median uq max neval
alexwhan 435 462.5 468.0 485 2398 100
sql 422 456.5 466.5 498 1262 100
I've also attempted to test both functions on some real data of the same format I have lying around (35,000 rows for genes
, 2,000,000 rows for markers
, with the joined
output coming to 480,000 rows).
Unfortunately merge
seems unable to handle this much data, falling over at joined.raw <- merge(genes, markers)
with an error (which i don't get if reduce the number of rows):
Error in merge.data.frame(genes, markers) :
negative length vectors are not allowed
While the sqldf
method runs successfully in 29 minutes.