Subset with unique cases, based on multiple columns
I'd like to subset a dataframe to include only rows that have unique combinations of three columns. My situation is similar to the one presented in this question, but I'd like to preserve the other columns in my data as well. Here's my example:
> df
v1 v2 v3 v4 v5
1 7 1 A 100 98
2 7 2 A 98 97
3 8 1 C NA 80
4 8 1 C 78 75
5 8 1 C 50 62
6 9 3 C 75 75
The requested output would be something like this, where I'm looking for unique cases based on v1, v2, and v3 only:
> df.new
v1 v2 v3 v4 v5
1 7 1 A 100 98
2 7 2 A 98 97
3 8 1 C NA 80
6 9 3 C 75 75
If I could recover the non-unique rows that would be great too:
> df.dupes
v1 v2 v3 v4 v5
3 8 1 C NA 80
4 8 1 C 78 75
5 8 1 C 50 62
I saw a related question for how to do this in sql (here), but I can't get this in R. I'm sure it's simple but messing with unique() and subset() hasn't been fruitful. Thanks in advance.
Solution 1:
You can use the duplicated()
function to find the unique combinations:
> df[!duplicated(df[1:3]),]
v1 v2 v3 v4 v5
1 7 1 A 100 98
2 7 2 A 98 97
3 8 1 C NA 80
6 9 3 C 75 75
To get only the duplicates, you can check it in both directions:
> df[duplicated(df[1:3]) | duplicated(df[1:3], fromLast=TRUE),]
v1 v2 v3 v4 v5
3 8 1 C NA 80
4 8 1 C 78 75
5 8 1 C 50 62
Solution 2:
Using dplyr
you could do:
library(dplyr)
# distinct
df %>%
distinct(v1, v2, v3, .keep_all = T)
# non-distinct only
df %>%
group_by(v1, v2, v3) %>%
filter(n() > 1)
# exclude any non-distinct
df %>%
group_by(v1, v2, v3) %>%
filter(n() == 1)