Apply several summary functions on several variables by group in one call
I have the following data frame
x <- read.table(text = " id1 id2 val1 val2
1 a x 1 9
2 a x 2 4
3 a y 3 5
4 a y 4 9
5 b x 1 7
6 b y 4 4
7 b x 3 9
8 b y 2 8", header = TRUE)
I want to calculate the mean of val1 and val2 grouped by id1 and id2, and simultaneously count the number of rows for each id1-id2 combination. I can perform each calculation separately:
# calculate mean
aggregate(. ~ id1 + id2, data = x, FUN = mean)
# count rows
aggregate(. ~ id1 + id2, data = x, FUN = length)
In order to do both calculations in one call, I tried
do.call("rbind", aggregate(. ~ id1 + id2, data = x, FUN = function(x) data.frame(m = mean(x), n = length(x))))
However, I get a garbled output along with a warning:
# m n
# id1 1 2
# id2 1 1
# 1.5 2
# 2 2
# 3.5 2
# 3 2
# 6.5 2
# 8 2
# 7 2
# 6 2
# Warning message:
# In rbind(id1 = c(1L, 2L, 1L, 2L), id2 = c(1L, 1L, 2L, 2L), val1 = list( :
# number of columns of result is not a multiple of vector length (arg 1)
I could use the plyr package, but my data set is quite large and plyr is very slow (almost unusable) when the size of the dataset grows.
How can I use aggregate
or other functions to perform several calculations in one call?
Solution 1:
You can do it all in one step and get proper labeling:
> aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) )
# id1 id2 val1.mn val1.n val2.mn val2.n
# 1 a x 1.5 2.0 6.5 2.0
# 2 b x 2.0 2.0 8.0 2.0
# 3 a y 3.5 2.0 7.0 2.0
# 4 b y 3.0 2.0 6.0 2.0
This creates a dataframe with two id columns and two matrix columns:
str( aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) )
'data.frame': 4 obs. of 4 variables:
$ id1 : Factor w/ 2 levels "a","b": 1 2 1 2
$ id2 : Factor w/ 2 levels "x","y": 1 1 2 2
$ val1: num [1:4, 1:2] 1.5 2 3.5 3 2 2 2 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "mn" "n"
$ val2: num [1:4, 1:2] 6.5 8 7 6 2 2 2 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "mn" "n"
As pointed out by @lord.garbage below, this can be converted to a dataframe with "simple" columns by using do.call(data.frame, ...)
str( do.call(data.frame, aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) )
)
'data.frame': 4 obs. of 6 variables:
$ id1 : Factor w/ 2 levels "a","b": 1 2 1 2
$ id2 : Factor w/ 2 levels "x","y": 1 1 2 2
$ val1.mn: num 1.5 2 3.5 3
$ val1.n : num 2 2 2 2
$ val2.mn: num 6.5 8 7 6
$ val2.n : num 2 2 2 2
This is the syntax for multiple variables on the LHS:
aggregate(cbind(val1, val2) ~ id1 + id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) )
Solution 2:
Given this in the question :
I could use the plyr package, but my data set is quite large and plyr is very slow (almost unusable) when the size of the dataset grows.
Then in data.table
(1.9.4+
) you could try :
> DT
id1 id2 val1 val2
1: a x 1 9
2: a x 2 4
3: a y 3 5
4: a y 4 9
5: b x 1 7
6: b y 4 4
7: b x 3 9
8: b y 2 8
> DT[ , .(mean(val1), mean(val2), .N), by = .(id1, id2)] # simplest
id1 id2 V1 V2 N
1: a x 1.5 6.5 2
2: a y 3.5 7.0 2
3: b x 2.0 8.0 2
4: b y 3.0 6.0 2
> DT[ , .(val1.m = mean(val1), val2.m = mean(val2), count = .N), by = .(id1, id2)] # named
id1 id2 val1.m val2.m count
1: a x 1.5 6.5 2
2: a y 3.5 7.0 2
3: b x 2.0 8.0 2
4: b y 3.0 6.0 2
> DT[ , c(lapply(.SD, mean), count = .N), by = .(id1, id2)] # mean over all columns
id1 id2 val1 val2 count
1: a x 1.5 6.5 2
2: a y 3.5 7.0 2
3: b x 2.0 8.0 2
4: b y 3.0 6.0 2
For timings comparing aggregate
(used in question and all 3 other answers) to data.table
see
this benchmark (the agg
and agg.x
cases).