Remove duplicates based on 2nd column condition
I am trying to remove duplicate rows from a data frame based on the max value on a different column
So, for the data frame:
df<-data.frame (rbind(c("a",2,3),c("a",3,4),c("a",3,5),c("b",1,3),c("b",2,6),c("r",4,5))
colnames(df)<-c("id","val1","val2")
id val1 val2
a 2 3
a 3 4
a 3 5
b 1 3
b 2 6
r 4 5
I would like to keep remove all duplicates by id with the condition that for the corresponding rows they do not have the maximum value for val2.
Thus the data frame should become:
a 3 5
b 2 6
r 4 5
-> remove all a duplicates but keep row with the max value for df$val2 for subset(df, df$id=="a")
Using base R
. Here, the columns are factors
. Make sure to convert it to numeric
df$val2 <- as.numeric(as.character(df$val2))
df[with(df, ave(val2, id, FUN=max)==val2),]
# id val1 val2
#3 a 3 5
#5 b 2 6
#6 r 4 5
Or using dplyr
library(dplyr)
df %>%
group_by(id) %>%
filter(val2==max(val2))
# id val1 val2
#1 a 3 5
#2 b 2 6
#3 r 4 5
One possible way is to use data.table
library(data.table)
setDT(df)[, .SD[which.max(val2)], by = id]
## id val1 val2
## 1: a 3 5
## 2: b 2 6
## 3: r 4 5