Find rows that are identical in one column but not another
Solution 1:
I think the easiest way will be to use dplyr::count
twice, hence for your example:
df %>%
count(name, id) %>%
count(name)
The first count will give:
name id n
george 1 2
george 2 1
sara 3 2
sam 4 1
bill 5 2
Then the second count will give:
name n
george 2
sara 1
sam 1
bill 1
Of course, you could add filter(n > 1)
to the end of your pipe, too, or arrange(desc(n))
df %>%
count(name, id) %>%
count(name) %>%
arrange(desc(n)) %>%
filter(n > 1)
Solution 2:
Using tapply()
to calculate number of ID's per name, then subset for greater than 1.
res <- with(df, tapply(id_num, list(name), \(x) length(unique(x))))
res[res > 1]
# george
# 2
You probably want to correct this. A safe way is to rebuild the numeric ID's using as.factor()
,
df$id_new <- as.integer(as.factor(df$name))
df
# name id_num id_new
# 1 george 1 2
# 2 george 1 2
# 3 george 2 2
# 4 sara 3 4
# 5 sara 3 4
# 6 sam 4 3
# 7 bill 5 1
# 8 bill 5 1
where numbers are assigned according to the names in alphabetical order, or factor()
, reading in the levels in order of appearance.
df$id_new2 <- as.integer(factor(df$name, levels=unique(df$name)))
df
# name id_num id_new id_new2
# 1 george 1 2 1
# 2 george 1 2 1
# 3 george 2 2 1
# 4 sara 3 4 2
# 5 sara 3 4 2
# 6 sam 4 3 3
# 7 bill 5 1 4
# 8 bill 5 1 4
Note: R >= 4.1 used.
Data:
df <- structure(list(name = c("george", "george", "george", "sara",
"sara", "sam", "bill", "bill"), id_num = c(1, 1, 2, 3, 3, 4,
5, 5)), class = "data.frame", row.names = c(NA, -8L))