R: speeding up "group by" operations
I have a simulation that has a huge aggregate and combine step right in the middle. I prototyped this process using plyr's ddply() function which works great for a huge percentage of my needs. But I need this aggregation step to be faster since I have to run 10K simulations. I'm already scaling the simulations in parallel but if this one step were faster I could greatly decrease the number of nodes I need.
Here's a reasonable simplification of what I am trying to do:
library(Hmisc)
# Set up some example data
year <- sample(1970:2008, 1e6, rep=T)
state <- sample(1:50, 1e6, rep=T)
group1 <- sample(1:6, 1e6, rep=T)
group2 <- sample(1:3, 1e6, rep=T)
myFact <- rnorm(100, 15, 1e6)
weights <- rnorm(1e6)
myDF <- data.frame(year, state, group1, group2, myFact, weights)
# this is the step I want to make faster
system.time(aggregateDF <- ddply(myDF, c("year", "state", "group1", "group2"),
function(df) wtd.mean(df$myFact, weights=df$weights)
)
)
All tips or suggestions are appreciated!
Solution 1:
Instead of the normal R data frame, you can use a immutable data frame which returns pointers to the original when you subset and can be much faster:
idf <- idata.frame(myDF)
system.time(aggregateDF <- ddply(idf, c("year", "state", "group1", "group2"),
function(df) wtd.mean(df$myFact, weights=df$weights)))
# user system elapsed
# 18.032 0.416 19.250
If I was to write a plyr function customised exactly to this situation, I'd do something like this:
system.time({
ids <- id(myDF[c("year", "state", "group1", "group2")], drop = TRUE)
data <- as.matrix(myDF[c("myFact", "weights")])
indices <- plyr:::split_indices(seq_len(nrow(data)), ids, n = attr(ids, "n"))
fun <- function(rows) {
weighted.mean(data[rows, 1], data[rows, 2])
}
values <- vapply(indices, fun, numeric(1))
labels <- myDF[match(seq_len(attr(ids, "n")), ids),
c("year", "state", "group1", "group2")]
aggregateDF <- cbind(labels, values)
})
# user system elapsed
# 2.04 0.29 2.33
It's so much faster because it avoids copying the data, only extracting the subset needed for each computation when it's computed. Switching the data to matrix form gives another speed boost because matrix subsetting is much faster than data frame subsetting.
Solution 2:
Further 2x speedup and more concise code:
library(data.table)
dtb <- data.table(myDF, key="year,state,group1,group2")
system.time(
res <- dtb[, weighted.mean(myFact, weights), by=list(year, state, group1, group2)]
)
# user system elapsed
# 0.950 0.050 1.007
My first post, so please be nice ;)
From data.table
v1.9.2, setDT
function is exported that'll convert data.frame
to data.table
by reference (in keeping with data.table
parlance - all set*
functions modify the object by reference). This means, no unnecessary copying, and is therefore fast. You can time it, but it'll be negligent.
require(data.table)
system.time({
setDT(myDF)
res <- myDF[, weighted.mean(myFact, weights),
by=list(year, state, group1, group2)]
})
# user system elapsed
# 0.970 0.024 1.015
This is as opposed to 1.264 seconds with OP's solution above, where data.table(.)
is used to create dtb
.