Joining aggregated values back to the original data frame [duplicate]
Solution 1:
One line of code does the trick:
new <- ddply( df, "group1", transform, numcolwise(mean))
new
group1 group2 values meanValue
1 1 A 0.48742905 -0.121033381
2 1 A -0.04493361 -0.121033381
3 1 C -0.62124058 -0.121033381
4 1 C -0.30538839 -0.121033381
5 2 A 1.51178117 0.004803931
6 2 B 0.73832471 0.004803931
7 2 A -0.01619026 0.004803931
8 2 B -2.21469989 0.004803931
9 3 B 1.12493092 0.758597929
10 3 C 0.38984324 0.758597929
11 3 B 0.57578135 0.758597929
12 3 A 0.94383621 0.758597929
identical(df, new)
[1] TRUE
Solution 2:
I think ave()
is more useful here than the plyr call you show (I'm not familiar enough with plyr to know if you can do what you want with plyr directly or not, I would be surprised if you can't!) or the other base R alternatives (aggregate()
, tapply()
).:
> with(df, ave(values, group1, FUN = mean))
[1] -0.121033381 0.004803931 0.758597929 -0.121033381 0.004803931
[6] 0.758597929 -0.121033381 0.004803931 0.758597929 -0.121033381
[11] 0.004803931 0.758597929
You can use within()
or transform()
to embed this result directly into df
:
> df2 <- within(df, meanValue <- ave(values, group1, FUN = mean))
> head(df2)
group1 group2 values meanValue
1 1 A 0.4874291 -0.121033381
2 2 B 0.7383247 0.004803931
3 3 B 0.5757814 0.758597929
4 1 C -0.3053884 -0.121033381
5 2 A 1.5117812 0.004803931
6 3 C 0.3898432 0.758597929
> df3 <- transform(df, meanValue = ave(values, group1, FUN = mean))
> all.equal(df2,df3)
[1] TRUE
And if the ordering is important:
> head(df2[order(df2$group1, df2$group2), ])
group1 group2 values meanValue
1 1 A 0.48742905 -0.121033381
10 1 A -0.04493361 -0.121033381
4 1 C -0.30538839 -0.121033381
7 1 C -0.62124058 -0.121033381
5 2 A 1.51178117 0.004803931
11 2 A -0.01619026 0.004803931
Solution 3:
In terms of performance, you can do this same kind of operation using the data.table
package, which has built in aggregation and is very fast thanks to indices and a C based implementation. For instance, given df
already exists from your example:
library("data.table")
dt<-as.data.table(df)
setkey(dt,group1)
dt<-dt[,list(group2,values,meanValue=mean(values)),by=group1]
dt
group1 group2 values meanValue
[1,] 1 A 0.82122120 0.18810771
[2,] 1 C 0.78213630 0.18810771
[3,] 1 C 0.61982575 0.18810771
[4,] 1 A -1.47075238 0.18810771
[5,] 2 B 0.59390132 0.03354688
[6,] 2 A 0.07456498 0.03354688
[7,] 2 B -0.05612874 0.03354688
[8,] 2 A -0.47815006 0.03354688
[9,] 3 B 0.91897737 -0.20205707
[10,] 3 C -1.98935170 -0.20205707
[11,] 3 B -0.15579551 -0.20205707
[12,] 3 A 0.41794156 -0.20205707
I have not benchmarked it, but in my experience it is a lot faster.
If you decide to go down the data.table road, which I think is worth exploring if you work with large data sets, you really need to read the docs because there are some differences from data frame that can bite you if you are unaware of them. However, notably data.table generally does work with any function expecting a data frame,as a data.table will claim its type is data frame (data table inherits from data frame).
[ Feb 2011 ]
[ Aug 2012 ] Update from Matthew :
New in v1.8.2 released to CRAN in July 2012 is :=
by group. This is very similar to the answer above, but adds the new column by reference to dt
so there is no copy and no need for a merge step or relisting existing columns to return alongside the aggregate. There is no need to setkey
first, and it copes with non-contiguous groups (i.e. groups that aren't grouped together).
This is signficantly faster for large datasets, and has a simple and short syntax :
dt <- as.data.table(df)
dt[, meanValue := mean(values), by = group1]
Solution 4:
Can't you just add x
to the function you pass to ddply
?
df <- ddply( df, "group1", function(x)
data.frame( x, meanValue = mean(x$values) ) )