how to avoid overwriting when merging multiple datasets in r

Suppose I have two datasets df1 and df2 as follows:

df1 <- data.frame(Id = c(1L,2L,3L,4L,5L,6L,7L,8L), pricetag = c("na","na","na","na","na","na","na","na"),stringsAsFactors=F)
df2 <- data.frame(Id=c(1L,2L,3L,4L), price = c(10,20,30,40), stringsAsFactors=F)

> df1
  Id pricetag
1  1       na
2  2       na
3  3       na
4  4       na
5  5       na
6  6       na
7  7       na
8  8       na
> df2
  Id price
1  1    10
2  2    20
3  3    30
4  4    40

I am trying to insert price values from df2 to df1 by matching the id using this function.

df1$pricetag <- df2$price[match(df1$Id, df2$Id)]

which provides this:

> df1
  Id pricetag
1  1       10
2  2       20
3  3       30
4  4       40
5  5       NA
6  6       NA
7  7       NA
8  8       NA

I have the third dataset. I am trying to follow the same procedure.

df3 <- data.frame(Id=c(5L,6L,7L,8L), price=c(50,60,70,80),stringsAsFactors=F)
> df3
  Id price
1  5    50
2  6    60
3  7    70
4  8    80

df1$pricetag <- df3$price[match(df1$Id, df3$Id)]

> df1
  Id pricetag
1  1       NA
2  2       NA
3  3       NA
4  4       NA
5  5       50
6  6       60
7  7       70
8  8       80

However, it overwrites the price information coming from df2 in the df1. Is there any way to turn this option off when I replicate the same procedure?

Solution 1:

Replace

df1$pricetag <- df3$price[match(df1$Id, df3$Id)]

in case you want to make an update-join (overwrite df1 with data in df3) with:

idx <- match(df1$Id, df3$Id)
idxn <- which(!is.na(idx))
df1$pricetag[idxn] <- df3$price[idx[idxn]]
rm(idx, idxn)
df1
#  Id pricetag
#1  1       10
#2  2       20
#3  3       30
#4  4       40
#5  5       50
#6  6       60
#7  7       70
#8  8       80

in case you want to make a gap-fill-join (fill NA's in df1 with data in df3) with:

idxg <- which(is.na(df1$pricetag))
idx <- match(df1$Id[idxg], df3$Id)
idxn <- which(!is.na(idx))
df1$pricetag[idxg][idxn] <- df3$price[idx[idxn]]
rm(idxg, idx, idxn)
df1
#  Id pricetag
#1  1       10
#2  2       20
#3  3       30
#4  4       40
#5  5       50
#6  6       60
#7  7       70
#8  8       80

Solution 2:

You can use the is.na function to identify rows to look up:

w = which(is.na(df1$pricetag))
df1$pricetag[w] <- df3$price[match(df1$Id[w], df3$Id)]

  Id category pricetag
1  1       na       10
2  2       na       20
3  3       na       30
4  4       na       40
5  5       na       50
6  6       na       60
7  7       na       70
8  8       na       80

There's some more convenient syntax for this with the data.table package:

df1 <- data.frame(Id=c(1L,2L,3L,4L,5L,6L,7L,8L), category="na", stringsAsFactors=FALSE)

library(data.table)
setDT(df1); setDT(df2); setDT(df3)

df1[, pricetag := NA_real_]
for (odf in list(df2, df3)) 
  df1[is.na(pricetag), 
    pricetag := odf[.SD, on=.(Id), x.price]
][]

   Id category pricetag
1:  1       na       10
2:  2       na       20
3:  3       na       30
4:  4       na       40
5:  5       na       50
6:  6       na       60
7:  7       na       70
8:  8       na       80

This kind of merge is called an "update join".

how to avoid overwriting when merging multiple datasets in r

Solution 1:

Solution 2:

Related

Recent Posts