Find rows that are identical in one column but not another

Solution 1:

I think the easiest way will be to use dplyr::count twice, hence for your example:

df %>% 
   count(name, id) %>% 
   count(name) 

The first count will give:

name   id   n
george  1   2
george  2   1
sara    3   2
sam     4   1
bill    5   2

Then the second count will give:

name    n
george  2
sara    1 
sam     1 
bill    1

Of course, you could add filter(n > 1) to the end of your pipe, too, or arrange(desc(n))

df %>% 
   count(name, id) %>% 
   count(name) %>% 
   arrange(desc(n)) %>% 
   filter(n > 1) 

Solution 2:

Using tapply() to calculate number of ID's per name, then subset for greater than 1.

res <- with(df, tapply(id_num, list(name), \(x) length(unique(x))))
res[res > 1]
# george 
#      2 

You probably want to correct this. A safe way is to rebuild the numeric ID's using as.factor(),

df$id_new <- as.integer(as.factor(df$name))
df
#     name id_num id_new
# 1 george      1      2
# 2 george      1      2
# 3 george      2      2
# 4   sara      3      4
# 5   sara      3      4
# 6    sam      4      3
# 7   bill      5      1
# 8   bill      5      1

where numbers are assigned according to the names in alphabetical order, or factor(), reading in the levels in order of appearance.

df$id_new2 <- as.integer(factor(df$name, levels=unique(df$name)))
df
#     name id_num id_new id_new2
# 1 george      1      2       1
# 2 george      1      2       1
# 3 george      2      2       1
# 4   sara      3      4       2
# 5   sara      3      4       2
# 6    sam      4      3       3
# 7   bill      5      1       4
# 8   bill      5      1       4

Note: R >= 4.1 used.


Data:

df <- structure(list(name = c("george", "george", "george", "sara", 
"sara", "sam", "bill", "bill"), id_num = c(1, 1, 2, 3, 3, 4, 
5, 5)), class = "data.frame", row.names = c(NA, -8L))