Condtional match columns across different dataframes
Solution 1:
Here are two solutions in the tidyverse
. While the first works for this particular case, I favor the second, which is more elegant and extensible.
Solution 1: A JOIN
for Each label*
First import the tidyverse
and generate your datasets original
and index
.
library(tidyverse)
# ...
# Code to generate 'original' and 'index' datasets.
# ...
Then apply this workflow.
original %>%
# Uniquely identify each row (for pivoting later).
mutate(row_id = row_number()) %>%
# Match 'label1' to the tags.
left_join(
index,
by = c("label1" = "item"),
keep = TRUE
) %>%
# Match 'label2' to the tags.
left_join(
index,
by = c("label2" = "item"),
keep = TRUE,
suffix = c(".1", ".2")
) %>%
# Pivot 'item.1 | ... | item.n | code.1 | ... | code.n' into a consolidated
# 'item | code' form.
pivot_longer(
cols = matches("^(item|code)\\.(\\d+)?$"),
names_pattern = "^(item|code)\\.(\\d+)?$",
names_to = c(".value", NA)
) %>%
# Pivot back into a 'tag1 | tag0' form.
pivot_wider(
values_from = item,
names_from = code,
names_glue = "tag{code}"
) %>%
# Omit unique identifier.
select(!row_id)
Result
Given the original
and index
datasets like those reproduced here
original <- data.frame(
label1 = c("cat", "cat", "dog", "dog", "cat", "tiger", "tiger", "cow"),
label2 = c("dog", "dog", "cat", "cat", "dog", "cow", "cow", "tiger")
)
index <- data.frame(
item = c("cat", "dog", "tiger", "cow"),
code = c(1, 0, 1, 0)
)
this solution should yield the following result:
# A tibble: 8 x 4
label1 label2 tag1 tag0
<chr> <chr> <chr> <chr>
1 cat dog cat dog
2 cat dog cat dog
3 dog cat cat dog
4 dog cat cat dog
5 cat dog cat dog
6 tiger cow tiger cow
7 tiger cow tiger cow
8 cow tiger tiger cow
Note
If your original
dataset has any further label*
columns, you'll need to perform an extra JOIN
for each and every one of those columns.
Solution 2: A Single CROSS JOIN
Here's a more elegant workflow, which is also more flexible: it works for an arbitrary number of label*
columns in original
and an arbitrary set of code
s in index
.
original %>%
# Uniquely identify each row (for pivoting later).
mutate(row_id = row_number()) %>%
# Perform a cross-join compare every 'item' to every 'label*'.
full_join(
index,
by = character()
) %>%
# Keep only those rows where 'item' matches a 'label*'.
rowwise() %>%
filter(item %in% c_across(matches("^label\\d+"))) %>%
# Pivot into a 'tag1 | tag0' form.
pivot_wider(
values_from = item,
names_from = code,
names_glue = "tag{code}"
) %>%
# Omit unique identifier.
select(!row_id)
Result
The results remain identical.
# A tibble: 8 x 4
label1 label2 tag1 tag0
<chr> <chr> <chr> <chr>
1 cat dog cat dog
2 cat dog cat dog
3 dog cat cat dog
4 dog cat cat dog
5 cat dog cat dog
6 tiger cow tiger cow
7 tiger cow tiger cow
8 cow tiger tiger cow
Note
The only drawback is that it must perform a CROSS JOIN
, which could hinder performance on larger datasets.
Solution 2:
Another possible solution:
library(tidyverse)
original <- data.frame(label1 = c("cat", "cat", "dog", "dog", "cat", "tiger", "tiger", "cow"),
label2 = c("dog", "dog", "cat", "cat", "dog", "cow", "cow", "tiger"))
index <- data.frame(item = c("cat", "dog", "tiger", "cow"),
code = c(1, 0, 1, 0))
original %>%
full_join(index, by=c("label1" = "item")) %>%
full_join(index, by=c("label2" = "item")) %>%
mutate(tag1 = if_else(code.x == 1, label1, label2)) %>%
mutate(tag2 = if_else(code.y == 1, label1, label2)) %>%
select(!starts_with("code"))
#> label1 label2 tag1 tag2
#> 1 cat dog cat dog
#> 2 cat dog cat dog
#> 3 dog cat cat dog
#> 4 dog cat cat dog
#> 5 cat dog cat dog
#> 6 tiger cow tiger cow
#> 7 tiger cow tiger cow
#> 8 cow tiger tiger cow