Aggregate a dataframe on a given column and display another column
I have a dataframe in R of the following form:
> head(data)
Group Score Info
1 1 1 a
2 1 2 b
3 1 3 c
4 2 4 d
5 2 3 e
6 2 1 f
I would like to aggregate it following the Score
column using the max
function
> aggregate(data$Score, list(data$Group), max)
Group.1 x
1 1 3
2 2 4
But I also would like to display the Info
column associated to the maximum value of the Score
column for each group. I have no idea how to do this. My desired output would be:
Group.1 x y
1 1 3 c
2 2 4 d
Any hint?
A base R solution is to combine the output of aggregate()
with a merge()
step. I find the formula interface to aggregate()
a little more useful than the standard interface, partly because the names on the output are nicer, so I'll use that:
The aggregate()
step is
maxs <- aggregate(Score ~ Group, data = dat, FUN = max)
and the merge()
step is simply
merge(maxs, dat)
This gives us the desired output:
R> maxs <- aggregate(Score ~ Group, data = dat, FUN = max)
R> merge(maxs, dat)
Group Score Info
1 1 3 c
2 2 4 d
You could, of course, stick this into a one-liner (the intermediary step was more for exposition):
merge(aggregate(Score ~ Group, data = dat, FUN = max), dat)
The main reason I used the formula interface is that it returns a data frame with the correct names
for the merge step; these are the names of the columns from the original data set dat
. We need to have the output of aggregate()
have the correct names so that merge()
knows which columns in the original and aggregated data frames match.
The standard interface gives odd names, whichever way you call it:
R> aggregate(dat$Score, list(dat$Group), max)
Group.1 x
1 1 3
2 2 4
R> with(dat, aggregate(Score, list(Group), max))
Group.1 x
1 1 3
2 2 4
We can use merge()
on those outputs, but we need to do more work telling R which columns match up.
First, you split the data using split
:
split(z,z$Group)
Than, for each chunk, select the row with max Score:
lapply(split(z,z$Group),function(chunk) chunk[which.max(chunk$Score),])
Finally reduce back to a data.frame do.call
ing rbind
:
do.call(rbind,lapply(split(z,z$Group),function(chunk) chunk[which.max(chunk$Score),]))
Result:
Group Score Info
1 1 3 c
2 2 4 d
One line, no magic spells, fast, result has good names =)
Here is a solution using the plyr
package.
The following line of code essentially tells ddply
to first group your data by Group, and then within each group returns a subset where the Score equals the maximum score in that group.
library(plyr)
ddply(data, .(Group), function(x)x[x$Score==max(x$Score), ])
Group Score Info
1 1 3 c
2 2 4 d
And, as @SachaEpskamp points out, this can be further simplified to:
ddply(df, .(Group), function(x)x[which.max(x$Score), ])
(which also has the advantage that which.max
will return multiple max lines, if there are any).