Finding ALL duplicate rows, including "elements with smaller subscripts"
R's duplicated
returns a vector showing whether each element of a vector or data frame is a duplicate of an element with a smaller subscript. So if rows 3, 4, and 5 of a 5-row data frame are the same, duplicated
will give me the vector
FALSE, FALSE, FALSE, TRUE, TRUE
But in this case I actually want to get
FALSE, FALSE, TRUE, TRUE, TRUE
that is, I want to know whether a row is duplicated by a row with a larger subscript too.
Solution 1:
duplicated
has a fromLast
argument. The "Example" section of ?duplicated
shows you how to use it. Just call duplicated
twice, once with fromLast=FALSE
and once with fromLast=TRUE
and take the rows where either are TRUE
.
Some late Edit: You didn't provide a reproducible example, so here's an illustration kindly contributed by @jbaums
vec <- c("a", "b", "c","c","c")
vec[duplicated(vec) | duplicated(vec, fromLast=TRUE)]
## [1] "c" "c" "c"
Edit: And an example for the case of a data frame:
df <- data.frame(rbind(c("a","a"),c("b","b"),c("c","c"),c("c","c")))
df[duplicated(df) | duplicated(df, fromLast=TRUE), ]
## X1 X2
## 3 c c
## 4 c c
Solution 2:
You need to assemble the set of duplicated
values, apply unique
, and then test with %in%
. As always, a sample problem will make this process come alive.
> vec <- c("a", "b", "c","c","c")
> vec[ duplicated(vec)]
[1] "c" "c"
> unique(vec[ duplicated(vec)])
[1] "c"
> vec %in% unique(vec[ duplicated(vec)])
[1] FALSE FALSE TRUE TRUE TRUE
Solution 3:
Duplicated rows in a dataframe could be obtained with dplyr
by doing
library(tidyverse)
df = bind_rows(iris, head(iris, 20)) # build some test data
df %>% group_by_all() %>% filter(n()>1) %>% ungroup()
To exclude certain columns group_by_at(vars(-var1, -var2))
could be used instead to group the data.
If the row indices and not just the data is actually needed, you could add them first as in:
df %>% add_rownames %>% group_by_at(vars(-rowname)) %>% filter(n()>1) %>% pull(rowname)
Solution 4:
I've had the same question, and if I'm not mistaken, this is also an answer.
vec[col %in% vec[duplicated(vec$col),]$col]
Dunno which one is faster, though, the dataset I'm currently using isn't big enough to make tests which produce significant time gaps.