Compute mean and standard deviation by group for multiple variables in a data.frame
This is an aggregation problem, not a reshaping problem as the question originally suggested -- we wish to aggregate each column into a mean and standard deviation by ID. There are many packages that handle such problems. In the base of R it can be done using aggregate
like this (assuming DF
is the input data frame):
ag <- aggregate(. ~ ID, DF, function(x) c(mean = mean(x), sd = sd(x)))
Note 1: A commenter pointed out that ag
is a data frame for which some columns are matrices. Although initially that may seem strange, in fact it simplifies access. ag
has the same number of columns as the input DF
. Its first column ag[[1]]
is ID
and the ith column of the remainder ag[[i+1]]
(or equivalanetly ag[-1][[i]]
) is the matrix of statistics for the ith input observation column. If one wishes to access the jth statistic of the ith observation it is therefore ag[[i+1]][, j]
which can also be written as ag[-1][[i]][, j]
.
On the other hand, suppose there are k
statistic columns for each observation in the input (where k=2 in the question). Then if we flatten the output then to access the jth statistic of the ith observation column we must use the more complex ag[[k*(i-1)+j+1]]
or equivalently ag[-1][[k*(i-1)+j]]
.
For example, compare the simplicity of the first expression vs. the second:
ag[-1][[2]]
## mean sd
## [1,] 36.333 10.2144
## [2,] 32.250 4.1932
## [3,] 43.500 4.9497
ag_flat <- do.call("data.frame", ag) # flatten
ag_flat[-1][, 2 * (2-1) + 1:2]
## Obs_2.mean Obs_2.sd
## 1 36.333 10.2144
## 2 32.250 4.1932
## 3 43.500 4.9497
Note 2: The input in reproducible form is:
Lines <- "ID Obs_1 Obs_2 Obs_3
1 43 48 37
1 27 29 22
1 36 32 40
2 33 38 36
2 29 32 27
2 32 31 35
2 25 28 24
3 45 47 42
3 38 40 36"
DF <- read.table(text = Lines, header = TRUE)
There are a few different ways to go about it. reshape2
is a helpful package.
Personally, I like using data.table
Below is a step-by-step
If myDF
is your data.frame
:
library(data.table)
DT <- data.table(myDF)
DT
# this will get you your mean and SD's for each column
DT[, sapply(.SD, function(x) list(mean=mean(x), sd=sd(x)))]
# adding a `by` argument will give you the groupings
DT[, sapply(.SD, function(x) list(mean=mean(x), sd=sd(x))), by=ID]
# If you would like to round the values:
DT[, sapply(.SD, function(x) list(mean=round(mean(x), 3), sd=round(sd(x), 3))), by=ID]
# If we want to add names to the columns
wide <- setnames(DT[, sapply(.SD, function(x) list(mean=round(mean(x), 3), sd=round(sd(x), 3))), by=ID], c("ID", sapply(names(DT)[-1], paste0, c(".men", ".SD"))))
wide
ID Obs.1.men Obs.1.SD Obs.2.men Obs.2.SD Obs.3.men Obs.3.SD
1: 1 35.333 8.021 36.333 10.214 33.0 9.644
2: 2 29.750 3.594 32.250 4.193 30.5 5.916
3: 3 41.500 4.950 43.500 4.950 39.0 4.243
Also, this may or may not be helpful
> DT[, sapply(.SD, summary), .SDcols=names(DT)[-1]]
Obs.1 Obs.2 Obs.3
Min. 25.00 28.00 22.00
1st Qu. 29.00 31.00 27.00
Median 33.00 32.00 36.00
Mean 34.22 36.11 33.22
3rd Qu. 38.00 40.00 37.00
Max. 45.00 48.00 42.00
Here is probably the simplest way to go about it (with a reproducible example):
library(plyr)
df <- data.frame(ID=rep(1:3, 3), Obs_1=rnorm(9), Obs_2=rnorm(9), Obs_3=rnorm(9))
ddply(df, .(ID), summarize, Obs_1_mean=mean(Obs_1), Obs_1_std_dev=sd(Obs_1),
Obs_2_mean=mean(Obs_2), Obs_2_std_dev=sd(Obs_2))
ID Obs_1_mean Obs_1_std_dev Obs_2_mean Obs_2_std_dev
1 1 -0.13994642 0.8258445 -0.15186380 0.4251405
2 2 1.49982393 0.2282299 0.50816036 0.5812907
3 3 -0.09269806 0.6115075 -0.01943867 1.3348792
EDIT: The following approach saves you a lot of typing when dealing with many columns.
ddply(df, .(ID), colwise(mean))
ID Obs_1 Obs_2 Obs_3
1 1 -0.3748831 0.1787371 1.0749142
2 2 -1.0363973 0.0157575 -0.8826969
3 3 1.0721708 -1.1339571 -0.5983944
ddply(df, .(ID), colwise(sd))
ID Obs_1 Obs_2 Obs_3
1 1 0.8732498 0.4853133 0.5945867
2 2 0.2978193 1.0451626 0.5235572
3 3 0.4796820 0.7563216 1.4404602
I add the dplyr
solution.
set.seed(1)
df <- data.frame(ID=rep(1:3, 3), Obs_1=rnorm(9), Obs_2=rnorm(9), Obs_3=rnorm(9))
library(dplyr)
df %>% group_by(ID) %>% summarise_each(funs(mean, sd))
# ID Obs_1_mean Obs_2_mean Obs_3_mean Obs_1_sd Obs_2_sd Obs_3_sd
# (int) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
# 1 1 0.4854187 -0.3238542 0.7410611 1.1108687 0.2885969 0.1067961
# 2 2 0.4171586 -0.2397030 0.2041125 0.2875411 1.8732682 0.3438338
# 3 3 -0.3601052 0.8195368 -0.4087233 0.8105370 0.3829833 1.4705692