Aggregate / summarize multiple variables per group (e.g. sum, mean)
From a data frame, is there a easy way to aggregate (sum
, mean
, max
et c) multiple variables simultaneously?
Below are some sample data:
library(lubridate)
days = 365*2
date = seq(as.Date("2000-01-01"), length = days, by = "day")
year = year(date)
month = month(date)
x1 = cumsum(rnorm(days, 0.05))
x2 = cumsum(rnorm(days, 0.05))
df1 = data.frame(date, year, month, x1, x2)
I would like to simultaneously aggregate the x1
and x2
variables from the df2
data frame by year and month. The following code aggregates the x1
variable, but is it also possible to simultaneously aggregate the x2
variable?
### aggregate variables by year month
df2=aggregate(x1 ~ year+month, data=df1, sum, na.rm=TRUE)
head(df2)
Solution 1:
Yes, in your formula
, you can cbind
the numeric variables to be aggregated:
aggregate(cbind(x1, x2) ~ year + month, data = df1, sum, na.rm = TRUE)
year month x1 x2
1 2000 1 7.862002 -7.469298
2 2001 1 276.758209 474.384252
3 2000 2 13.122369 -128.122613
...
23 2000 12 63.436507 449.794454
24 2001 12 999.472226 922.726589
See ?aggregate
, the formula
argument and the examples.
Solution 2:
With the dplyr
package, you can use summarise_all
, summarise_at
or summarise_if
functions to aggregate multiple variables simultaneously. For the example dataset you can do this as follows:
library(dplyr)
# summarising all non-grouping variables
df2 <- df1 %>% group_by(year, month) %>% summarise_all(sum)
# summarising a specific set of non-grouping variables
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(x1, x2), sum)
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(-date), sum)
# summarising a specific set of non-grouping variables using select_helpers
# see ?select_helpers for more options
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(starts_with('x')), sum)
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(matches('.*[0-9]')), sum)
# summarising a specific set of non-grouping variables based on condition (class)
df2 <- df1 %>% group_by(year, month) %>% summarise_if(is.numeric, sum)
The result of the latter two options:
year month x1 x2
<dbl> <dbl> <dbl> <dbl>
1 2000 1 -73.58134 -92.78595
2 2000 2 -57.81334 -152.36983
3 2000 3 122.68758 153.55243
4 2000 4 450.24980 285.56374
5 2000 5 678.37867 384.42888
6 2000 6 792.68696 530.28694
7 2000 7 908.58795 452.31222
8 2000 8 710.69928 719.35225
9 2000 9 725.06079 914.93687
10 2000 10 770.60304 863.39337
# ... with 14 more rows
Note: summarise_each
is deprecated in favor of summarise_all
, summarise_at
and summarise_if
.
As mentioned in my comment above, you can also use the recast
function from the reshape2
-package:
library(reshape2)
recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
which will give you the same result.
Solution 3:
Using the data.table
package, which is fast (useful for larger datasets)
https://github.com/Rdatatable/data.table/wiki
library(data.table)
df2 <- setDT(df1)[, lapply(.SD, sum), by=.(year, month), .SDcols=c("x1","x2")]
setDF(df2) # convert back to dataframe
Using the plyr package
require(plyr)
df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))
Using summarize() from the Hmisc package (column headings are messy in my example though)
# need to detach plyr because plyr and Hmisc both have a summarize()
detach(package:plyr)
require(Hmisc)
df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))
Solution 4:
Where is this year()
function from?
You could also use the reshape2
package for this task:
require(reshape2)
df_melt <- melt(df1, id = c("date", "year", "month"))
dcast(df_melt, year + month ~ variable, sum)
# year month x1 x2
1 2000 1 -80.83405 -224.9540159
2 2000 2 -223.76331 -288.2418017
3 2000 3 -188.83930 -481.5601913
4 2000 4 -197.47797 -473.7137420
5 2000 5 -259.07928 -372.4563522