how to avoid overwriting when merging multiple datasets in r
Suppose I have two datasets df1 and df2 as follows:
df1 <- data.frame(Id = c(1L,2L,3L,4L,5L,6L,7L,8L), pricetag = c("na","na","na","na","na","na","na","na"),stringsAsFactors=F)
df2 <- data.frame(Id=c(1L,2L,3L,4L), price = c(10,20,30,40), stringsAsFactors=F)
> df1
Id pricetag
1 1 na
2 2 na
3 3 na
4 4 na
5 5 na
6 6 na
7 7 na
8 8 na
> df2
Id price
1 1 10
2 2 20
3 3 30
4 4 40
I am trying to insert price values from df2 to df1 by matching the id using this function.
df1$pricetag <- df2$price[match(df1$Id, df2$Id)]
which provides this:
> df1
Id pricetag
1 1 10
2 2 20
3 3 30
4 4 40
5 5 NA
6 6 NA
7 7 NA
8 8 NA
I have the third dataset. I am trying to follow the same procedure.
df3 <- data.frame(Id=c(5L,6L,7L,8L), price=c(50,60,70,80),stringsAsFactors=F)
> df3
Id price
1 5 50
2 6 60
3 7 70
4 8 80
df1$pricetag <- df3$price[match(df1$Id, df3$Id)]
> df1
Id pricetag
1 1 NA
2 2 NA
3 3 NA
4 4 NA
5 5 50
6 6 60
7 7 70
8 8 80
However, it overwrites the price information coming from df2 in the df1. Is there any way to turn this option off when I replicate the same procedure?
Solution 1:
Replace
df1$pricetag <- df3$price[match(df1$Id, df3$Id)]
in case you want to make an update-join (overwrite df1 with data in df3) with:
idx <- match(df1$Id, df3$Id)
idxn <- which(!is.na(idx))
df1$pricetag[idxn] <- df3$price[idx[idxn]]
rm(idx, idxn)
df1
# Id pricetag
#1 1 10
#2 2 20
#3 3 30
#4 4 40
#5 5 50
#6 6 60
#7 7 70
#8 8 80
in case you want to make a gap-fill-join (fill NA's in df1 with data in df3) with:
idxg <- which(is.na(df1$pricetag))
idx <- match(df1$Id[idxg], df3$Id)
idxn <- which(!is.na(idx))
df1$pricetag[idxg][idxn] <- df3$price[idx[idxn]]
rm(idxg, idx, idxn)
df1
# Id pricetag
#1 1 10
#2 2 20
#3 3 30
#4 4 40
#5 5 50
#6 6 60
#7 7 70
#8 8 80
Solution 2:
You can use the is.na
function to identify rows to look up:
w = which(is.na(df1$pricetag))
df1$pricetag[w] <- df3$price[match(df1$Id[w], df3$Id)]
Id category pricetag
1 1 na 10
2 2 na 20
3 3 na 30
4 4 na 40
5 5 na 50
6 6 na 60
7 7 na 70
8 8 na 80
There's some more convenient syntax for this with the data.table package:
df1 <- data.frame(Id=c(1L,2L,3L,4L,5L,6L,7L,8L), category="na", stringsAsFactors=FALSE)
library(data.table)
setDT(df1); setDT(df2); setDT(df3)
df1[, pricetag := NA_real_]
for (odf in list(df2, df3))
df1[is.na(pricetag),
pricetag := odf[.SD, on=.(Id), x.price]
][]
Id category pricetag
1: 1 na 10
2: 2 na 20
3: 3 na 30
4: 4 na 40
5: 5 na 50
6: 6 na 60
7: 7 na 70
8: 8 na 80
This kind of merge is called an "update join".