How do I add an .x and .y suffix to all duplicate variables after the join?

I was using dplyr::left_join() initially, but it didn't have the match = "first" argument that I wanted that was available in plyr::join. The problem now is that my remaining code has .x and .y for duplicate variables so I can't just use janitor::clean_names().

To give you a sense of what I'm talking about:

id <- c(1, 2, 3, 4, 5)
name1 <- c("a", "b", "c", "d", "e")
name2 <- c("k", "l", "m", "n", "o")
name3 <- c("u", "v", "w", "x", "y")
name4 <- c("z", "a", "b", "c", "d")

df <- data.frame(id, name1, name2, name3, name4)

id <- c(1, 2, 3, 4, 5)
name1 <- c("f", "i", "j", "k", "l")
name2 <- c("p", "q", "r", "s", "t")
name3 <- c("z", "a", "b", "c", "d")
name5 <- c("z", "a", "b", "c", "d")
df2 <- data.frame(id, name1, name2, name3, name5)

library(plyr)

df_join <- plyr::join(df, df2, by=c("id"), type="left", match="first")

I want to turn the duplicate variables (name1, name2, name3) into the following:

Right side: name1.x, name2.x, name3.x Left side: name1.y, name2.y, name3.y

How would I go about doing this? janitor::clean_names() seems to only add "_2" for the left hand side variables.


Solution 1:

You can still use dplyr::left_join. To mimic the behavior of plyr::join, just turn duplicated ids in your right dataframe into NAs or any other values that will not be found in the id column of your left dataframe. For instance,

library(dplyr)

df2 %>% 
  mutate(id = replace(id, duplicated(id), NA_integer_)) %>% 
  left_join(df, ., by = "id", suffix = c(".y", ".x"))

Output

  id name1.y name2.y name3.y name4 name1.x name2.x name3.x name5
1  1       a       k       u     z       f       p       z     z
2  2       b       l       v     a       i       q       a     a
3  3       c       m       w     b       j       r       b     b
4  4       d       n       x     c       k       s       c     c
5  5       e       o       y     d       l       t       d     d

The two dataframes are

> df
  id name1 name2 name3 name4
1  1     a     k     u     z
2  2     b     l     v     a
3  3     c     m     w     b
4  4     d     n     x     c
5  5     e     o     y     d
> df2
  id name1 name2 name3 name5
1  1     f     p     z     z
2  2     i     q     a     a
3  3     j     r     b     b
4  4     k     s     c     c
5  5     l     t     d     d
6  1   XXX   XXX   XXX   XXX

This one should work for any number of columns you want to join by (as long as you specify them).

libaray(dplyr)

by <- c("id1", "id2")
df2 %>%                                    # v-v------------------- Not a typo. You need two commas here.
  mutate(across(!!by) %>% `[<-`(duplicated(.), , value = NA)) %>% 
  left_join(df, ., by = by, suffix = c(".y", ".x"))

The two dataframes

> df
  id1 id2 name1 name2 name3 name4
1   1   5     a     k     u     z
2   2   4     b     l     v     a
3   3   3     c     m     w     b
4   4   2     d     n     x     c
5   5   1     e     o     y     d
> df2
  id1 id2 name1 name2 name3 name5
1   1   5     f     p     z     z
2   2   4     i     q     a     a
3   3   3     j     r     b     b
4   4   2     k     s     c     c
5   5   1     l     t     d     d
6   3   3   XXX   XXX   XXX   XXX

Output

  id1 id2 name1.y name2.y name3.y name4 name1.x name2.x name3.x name5
1   1   5       a       k       u     z       f       p       z     z
2   2   4       b       l       v     a       i       q       a     a
3   3   3       c       m       w     b       j       r       b     b
4   4   2       d       n       x     c       k       s       c     c
5   5   1       e       o       y     d       l       t       d     d