Merge R data frame or data table and overwrite values of multiple columns
I'd probably put the data in long form and drop dupes:
k = key(dt_1)
DTList = list(dt_1, dt_2)
DTLong = rbindlist(lapply(DTList, function(x) melt(x, id=k)))
setorder(DTLong, na.last = TRUE)
unique(DTLong, by=c(k, "variable"))
id date variable value
1: abc 2018-01-01 a 3
2: abc 2018-01-01 b 5
3: abc 2018-01-01 c 4
4: abc 2018-01-01 d 6
5: abc 2018-01-01 e NA
You can do this by using dplyr::coalesce
, which will return the first non-missing value from vectors.
(EDIT: you can use dplyr::coalesce
directly on the data frames also, no need to create the function below. Left it there just for completeness, as a record of the original answer.)
Credit where it's due: this code is mostly from this blog post, it builds a function that will take two data frames and do what you need (taking values from the x
data frame if they are present).
coalesce_join <- function(x,
y,
by,
suffix = c(".x", ".y"),
join = dplyr::full_join, ...) {
joined <- join(x, y, by = by, suffix = suffix, ...)
# names of desired output
cols <- union(names(x), names(y))
to_coalesce <- names(joined)[!names(joined) %in% cols]
suffix_used <- suffix[ifelse(endsWith(to_coalesce, suffix[1]), 1, 2)]
# remove suffixes and deduplicate
to_coalesce <- unique(substr(
to_coalesce,
1,
nchar(to_coalesce) - nchar(suffix_used)
))
coalesced <- purrr::map_dfc(to_coalesce, ~dplyr::coalesce(
joined[[paste0(.x, suffix[1])]],
joined[[paste0(.x, suffix[2])]]
))
names(coalesced) <- to_coalesce
dplyr::bind_cols(joined, coalesced)[cols]
}
We can use {powerjoin}, do a left join and deal with the conflicts using coalesce_xy()
(which is pretty much dplyr::coalesce()
).
library(powerjoin)
power_left_join(dt_1, dt_2, by = "id", conflict = coalesce_xy)
# id date a b c d e
# 1 abc 2018-01-01 3 5 4 6 NA