Fast vectorized merge of list of data.frames by row
Most of the questions about merging data.frame in lists on SO don't quite relate to what I'm trying to get across here, but feel free to prove me wrong.
I have a list of data.frames. I would like to "rbind" rows into another data.frame by row. In essence, all first rows form one data.frame, second rows second data.frame and so on. Result would be a list of the same length as the number of rows in my original data.frame(s). So far, the data.frames are identical in dimensions.
Here's some data to play around with.
sample.list <- list(data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)),
data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)),
data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)),
data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)),
data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)),
data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)),
data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)))
Here's what I've come up with with the good ol' for loop.
#solution 1
my.list <- vector("list", nrow(sample.list[[1]]))
for (i in 1:nrow(sample.list[[1]])) {
for (j in 1:length(sample.list)) {
my.list[[i]] <- rbind(my.list[[i]], sample.list[[j]][i, ])
}
}
#solution 2 (so far my favorite)
sample.list2 <- do.call("rbind", sample.list)
my.list2 <- vector("list", nrow(sample.list[[1]]))
for (i in 1:nrow(sample.list[[1]])) {
my.list2[[i]] <- sample.list2[seq(from = i, to = nrow(sample.list2), by = nrow(sample.list[[1]])), ]
}
Can this be improved using vectorization without much brainhurt? Correct answer will contain a snippet of code, of course. "Yes" as an answer doesn't count.
EDIT
#solution 3 (a variant of solution 2 above)
ind <- rep(1:nrow(sample.list[[1]]), times = length(sample.list))
my.list3 <- split(x = sample.list2, f = ind)
BENCHMARKING
I've made my list larger with more rows per data.frame. I've benchmarked the results which are as follows:
#solution 1
system.time(for (i in 1:nrow(sample.list[[1]])) {
for (j in 1:length(sample.list)) {
my.list[[i]] <- rbind(my.list[[i]], sample.list[[j]][i, ])
}
})
user system elapsed
80.989 0.004 81.210
# solution 2
system.time(for (i in 1:nrow(sample.list[[1]])) {
my.list2[[i]] <- sample.list2[seq(from = i, to = nrow(sample.list2), by = nrow(sample.list[[1]])), ]
})
user system elapsed
0.957 0.160 1.126
# solution 3
system.time(split(x = sample.list2, f = ind))
user system elapsed
1.104 0.204 1.332
# solution Gabor
system.time(lapply(1:nr, bind.ith.rows))
user system elapsed
0.484 0.000 0.485
# solution ncray
system.time(alply(do.call("cbind",sample.list), 1,
.fun=matrix, ncol=ncol(sample.list[[1]]), byrow=TRUE,
dimnames=list(1:length(sample.list),names(sample.list[[1]]))))
user system elapsed
11.296 0.016 11.365
Solution 1:
Try this:
bind.ith.rows <- function(i) do.call(rbind, lapply(sample.list, "[", i, TRUE))
nr <- nrow(sample.list[[1]])
lapply(1:nr, bind.ith.rows)
Solution 2:
A couple of solutions that will make this quicker using data.table
EDIT - with larger dataset showing data.table
awesomeness even more.
# here are some sample data
sample.list <- replicate(10000, data.frame(x = sample(1:100, 10),
y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)), simplify = F)
Gabor's fast solution:
# Solution Gabor
bind.ith.rows <- function(i) do.call(rbind, lapply(sample.list, "[", i, TRUE))
nr <- nrow(sample.list[[1]])
system.time(rowbound <- lapply(1:nr, bind.ith.rows))
## user system elapsed
## 25.87 0.01 25.92
The data.table function rbindlist
will make this even quicker even when working with data.frames)
library(data.table)
fastbind.ith.rows <- function(i) rbindlist(lapply(sample.list, "[", i, TRUE))
system.time(fastbound <- lapply(1:nr, fastbind.ith.rows))
## user system elapsed
## 13.89 0.00 13.89
A data.table
solution
Here is a solution that uses data.tables - it is split
solution on steroids.
# data.table solution
system.time({
# change each element of sample.list to a data.table (and data.frame) this
# is done instaneously by reference
invisible(lapply(sample.list, setattr, name = "class",
value = c("data.table", "data.frame")))
# combine into a big data set
bigdata <- rbindlist(sample.list)
# add a row index column (by refere3nce)
index <- as.character(seq_len(nr))
bigdata[, `:=`(rowid, index)]
# set the key for binary searches
setkey(bigdata, rowid)
# split on this -
dt_list <- lapply(index, function(i, j, x) x[i = J(i)], x = bigdata)
# if you want to drop the `row id` column
invisible(lapply(dt_list, function(x) set(x, j = "rowid", value = NULL)))
# if you really don't want them to be data.tables run this line
# invisible(lapply(dt_list, setattr,name = 'class', value =
# c('data.frame')))
})
################################
## user system elapsed ##
## 0.08 0.00 0.08 ##
################################
How awesome is data.table
!
Caveat user with rbindlist
rbindlist
is fast because it does not perform the checking that do.call(rbind,....)
will. For example it assumes that any factor columns have the same levels as in the first element of the list.