Find indices of duplicated rows [duplicate]
Function duplicated in R performs duplicate row search. If we want to remove the duplicates, we need just to write df[!duplicated(df),]
and duplicates will be removed from data frame.
But how to find the indices of duplicated data? If duplicated
returns TRUE on some row, it means, that this is the second occurence of such a row in the data frame and its index can be easily obtained. How to obtain the index of first occurence of this row? Or, in other words, an index with which the duplicated row is identical?
I could make a loop on data.frame, but I think there is a more elegant answer on this question.
Solution 1:
Here's an example:
df <- data.frame(a = c(1,2,3,4,1,5,6,4,2,1))
duplicated(df) | duplicated(df, fromLast = TRUE)
#[1] TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
How it works?
The function duplicated(df)
determines duplicate elements in the original data. The fromLast = TRUE
indicates that "duplication should be considered from the reverse side". The two resulting logical vectors are combined using |
since a TRUE
in at least one of them indicates a duplicated value.
Solution 2:
If you are using a keyed data.table, then you can use the following elegant syntax
library(data.table)
DT <- data.table(A = rep(1:3, each=4),
B = rep(1:4, each=3),
C = rep(1:2, 6), key = "A,B,C")
DT[unique(DT[duplicated(DT)]),which=T]
To unpack
DT[duplicated(DT)]
subsets those rows which are duplicates.unique(...)
returns only the unique combinations of the duplicated rows. This deals with any cases with more than 1 duplicate (duplicate duplicates eg triplicates etc)DT[..., which = T]
merges the duplicate rows with the original, withwhich=T
returning the row number (withoutwhich = T
it would just return the data).
You could also use
DT[,count := .N,by = list(A,B,C)][count>1, which=T]