How do I add an .x and .y suffix to all duplicate variables after the join?
I was using dplyr::left_join() initially, but it didn't have the match = "first" argument that I wanted that was available in plyr::join. The problem now is that my remaining code has .x and .y for duplicate variables so I can't just use janitor::clean_names().
To give you a sense of what I'm talking about:
id <- c(1, 2, 3, 4, 5)
name1 <- c("a", "b", "c", "d", "e")
name2 <- c("k", "l", "m", "n", "o")
name3 <- c("u", "v", "w", "x", "y")
name4 <- c("z", "a", "b", "c", "d")
df <- data.frame(id, name1, name2, name3, name4)
id <- c(1, 2, 3, 4, 5)
name1 <- c("f", "i", "j", "k", "l")
name2 <- c("p", "q", "r", "s", "t")
name3 <- c("z", "a", "b", "c", "d")
name5 <- c("z", "a", "b", "c", "d")
df2 <- data.frame(id, name1, name2, name3, name5)
library(plyr)
df_join <- plyr::join(df, df2, by=c("id"), type="left", match="first")
I want to turn the duplicate variables (name1, name2, name3) into the following:
Right side: name1.x, name2.x, name3.x Left side: name1.y, name2.y, name3.y
How would I go about doing this? janitor::clean_names() seems to only add "_2" for the left hand side variables.
Solution 1:
You can still use dplyr::left_join
. To mimic the behavior of plyr::join
, just turn duplicated id
s in your right dataframe into NA
s or any other values that will not be found in the id
column of your left dataframe. For instance,
library(dplyr)
df2 %>%
mutate(id = replace(id, duplicated(id), NA_integer_)) %>%
left_join(df, ., by = "id", suffix = c(".y", ".x"))
Output
id name1.y name2.y name3.y name4 name1.x name2.x name3.x name5
1 1 a k u z f p z z
2 2 b l v a i q a a
3 3 c m w b j r b b
4 4 d n x c k s c c
5 5 e o y d l t d d
The two dataframes are
> df
id name1 name2 name3 name4
1 1 a k u z
2 2 b l v a
3 3 c m w b
4 4 d n x c
5 5 e o y d
> df2
id name1 name2 name3 name5
1 1 f p z z
2 2 i q a a
3 3 j r b b
4 4 k s c c
5 5 l t d d
6 1 XXX XXX XXX XXX
This one should work for any number of columns you want to join by (as long as you specify them).
libaray(dplyr)
by <- c("id1", "id2")
df2 %>% # v-v------------------- Not a typo. You need two commas here.
mutate(across(!!by) %>% `[<-`(duplicated(.), , value = NA)) %>%
left_join(df, ., by = by, suffix = c(".y", ".x"))
The two dataframes
> df
id1 id2 name1 name2 name3 name4
1 1 5 a k u z
2 2 4 b l v a
3 3 3 c m w b
4 4 2 d n x c
5 5 1 e o y d
> df2
id1 id2 name1 name2 name3 name5
1 1 5 f p z z
2 2 4 i q a a
3 3 3 j r b b
4 4 2 k s c c
5 5 1 l t d d
6 3 3 XXX XXX XXX XXX
Output
id1 id2 name1.y name2.y name3.y name4 name1.x name2.x name3.x name5
1 1 5 a k u z f p z z
2 2 4 b l v a i q a a
3 3 3 c m w b j r b b
4 4 2 d n x c k s c c
5 5 1 e o y d l t d d