Merging two Data Frames using Fuzzy/Approximate String Matching in R
Solution 1:
It's highly recommended to use the dgrtwo/fuzzyjoin package.
stringdist_inner_join(a,b, by="Fund.Name")
Solution 2:
One quick suggestion: try to do some matching on the different fields separately before using merge. The simplest approach is with the pmatch
function, although R has no shortage of text matching functions (e.g. agrep
). Here's a simple example:
pmatch(c("med", "mod"), c("mean", "median", "mode"))
For your dataset, this matches all the fund names out of a
:
> nrow(merge(a,b,x.by="Fund.Name", y.by="Fund.name"))
[1] 58
> length(which(!is.na(pmatch(a$Fund.Name, b$Fund.name))))
[1] 238
Once you create matches, you can easily merge them together using those instead.