How do I conditionally combine some rows (but not others) in R data frames?
I'm trying to figure out how to merge/join two data frames in such a way that if a certain condition is met, R combines two rows from the two data frames into one row, but if the condition is not met, R adds a new row with NAs in the columns that are not present in the original data frame. I'm unclear if this is a simpler join than I think it is, but I've been unable to figure out how to do this, even after reading through some stack overflow results (e.g. or e.g.).
Below are two example data frames:
df1 <- data.frame(Name_df1 = c("Alan", "Steve", "Melanie", "Steve", "Melanie"),
Date_df1 = c("02/18/2008", "02/18/2008", "03/14/2009", "04/19/2009", "03/16/2009"),
Job_df1 = c("Cook", "Security", "Greeter", "Security", "Greeter"),
Hours_df1 = c(8.5, 7.0, 6.0, 7.0, 6.0))
df2 <- data.frame(Name_df2 = c("Steve", "Alan", "Melanie", "Melanie", "Steve", "Carter"),
Date_df2 = c("02/18/2008", "02/18/2008", "03/14/2009", "03/17/2009","04/25/2009", "08/15/2011"),
Job_df2 = c("Police", "Chef", "Greeter", "Greeter", "Security", "Doorman"),
Hours_df2 = c(8.5, 7.0, 6.0, 7.0, 7.0, 6.5),
Wage_df2 = c(80, 77, 127.5, 90, 145, 100))
df1 and df2 both describe dates on which certain individuals worked, their job titles, names, and hours worked on the specified date, and only df2 also contains wage information. What I want to do is join df1 and df2 into a df3 with the following columns:
- "Name_df1"
- "Name_df2"
- "Date_df1"
- "Date_df2"
- "Job_df1"
- "Job_df2"
- "Hours_df1"
- "Hours_df2"
- "Wage_df2"
Specifically, I want to join df1 and df2 such that for every row in df1 or df2, if there is an exact match for both the value in the "Name_df1"/"Name_df2" column and the value in the "Date_df1"/"Date_df2" column in the corresponding column in the opposite data frame (i.e. in df1, if the value in the "Date_df1" column and "Name_df1" columns each have an exact match in the "Date_df2" column and "Name_df2" column, respectively, of df2):
- A row is created in df3 that has the "Name_df1", "Name_df2", "Date_df1", "Date_df2", "Job_df1", "Job_df2", "Hours_df1", "Hours_df2", and "Wage" columns filled in with the relevant values from df1 and df2. The values in the "Job_df1", "Job_df2", "Hours_df1", and "Hours_df2", may or may not be the same.
However, if there is not an exact match for both the value in the "Name_df1"/"Name_df2" column and the value in the "Date_df1"/"Date_df2" column, then df3 should get a row that includes only the information from df1 in "Name_df1", "Date_df1", "Job_df1", and "Hours_df1", and NAs in "Name_df2", "Date_df2", "Job_df2", "Hours_df2", and "Wage_df2". Or, if the original row is from df2, then df3 should get a row that includes only the information from df2 in "Name_df2", "Date_df2", "Job_df2", "Hours_df2", and "Wage_df2", and NAs in "Name_df1", "Date_df1", "Job_df1", and "Hours_df1".
I know that's a mouthful, but based on the df1 and df2 specified above, this is what df3 would look like:
df3 <- data.frame(Name_df1 = c("Alan", "Steve", "Melanie", "Steve", "Melanie", NA, NA, NA),
Name_df2 = c("Alan", "Steve", "Melanie", NA, NA, "Melanie", "Steve", "Carter"),
Date_df1 = c("02/18/2008", "02/18/2008", "03/14/2009", "04/19/2009", "03/16/2009", NA, NA, NA),
Date_df2 = c("02/18/2008", "02/18/2008", "03/14/2009", NA, NA, "03/17/2009", "04/25/2009", "08/15/2011"),
Job_df1 = c("Cook", "Security", "Greeter", "Security", "Greeter", NA, NA, NA),
Job_df2 = c("Chef", "Police", "Greeter", NA, NA, "Greeter", "Security", "Doorman"),
Hours_df1 = c(7.0, 7.0, 6.0, 7.0, 6.0, NA, NA, NA),
Hours_df2 = c(7.0, 8.5, 6.0, NA, NA, 7.0, 7.0, 6.5),
Wage_df2 = c(77.0, 80, 127.5, NA, NA, 90.0, 145.0, 100))
Could anyone help me with this?
It seems you are looking for full_join
but to get the exact desired output (df3
) you need to perform some manipulation.
library(dplyr)
full_join(df1, df2, by = c('Name_df1' = 'Name_df2', 'Date_df1' = 'Date_df2')) %>%
mutate(Name_df2 = ifelse(is.na(Job_df2), NA, Name_df1),
Name_df1 = ifelse(is.na(Job_df1), NA, Name_df1),
Date_df2 = ifelse(is.na(Job_df2), NA, Date_df1),
Date_df1 = ifelse(is.na(Job_df1), NA, Date_df1)) %>%
select(starts_with('Name'), starts_with('Date'), starts_with('Job'), everything())
# Name_df1 Name_df2 Date_df1 Date_df2 Job_df1 Job_df2 Hours_df1 Hours_df2 Wage_df2
#1 Alan Alan 02/18/2008 02/18/2008 Cook Chef 8.5 7.0 77.0
#2 Steve Steve 02/18/2008 02/18/2008 Security Police 7.0 8.5 80.0
#3 Melanie Melanie 03/14/2009 03/14/2009 Greeter Greeter 6.0 6.0 127.5
#4 Steve <NA> 04/19/2009 <NA> Security <NA> 7.0 NA NA
#5 Melanie <NA> 03/16/2009 <NA> Greeter <NA> 6.0 NA NA
#6 <NA> Melanie <NA> 03/17/2009 <NA> Greeter NA 7.0 90.0
#7 <NA> Steve <NA> 04/25/2009 <NA> Security NA 7.0 145.0
#8 <NA> Carter <NA> 08/15/2011 <NA> Doorman NA 6.5 100.0
Maybe this could be a way to join those two data frames:
library(dplyr) df3 <- df1 %>% full_join(df2, by="Name", suffix= c(".df1", ".df2"))
Result:
Table