Formula for finding 'near perfect' matches [duplicate]
I have a list of +-8000 items which is the result after applying numerous techniques to remove duplicates (the list started at over 10000 items).
I now have the following issue -
Example 1. Jack Daniel 2. Jack Daniels 3. Jack Daniel's
Clearly all of the above relate to one item but they are still technically unique. I tried pulling the first 4 letters and checking for matches however there were +-4000 and most of them were false positives i.e.
- Jack Dxxx
- Jack Bxxx
Both would pull Jack but would not be a valid duplicate.
Any thoughts?
Based on Zoredache's comment, here is my sample workbook using VBA and the Levenshtein Distance to find similar strings within a big list. It's based on @smirkingman and @Apostolos55 answers on stackoverflow.
The Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other
I implemented two different versions. Please check which function is faster for your case with 8000 values. If you're curious, view the full VBA code on Github. Raise the threshold in the line const treshold = 1
if you want results with more than 1 required edit to get a match somewhere.
- Formula syntax:
=LevenshteinCompare( <cell_to_check> , <range_to_search_in> )
Example:=LevenshteinCompare(A2;A$2:A$12)
(Notice the fixed range) - Output syntax:
<number_of_required_edits> - [<match_address>] <match_value>
Private Function Levenshtein(S1 As String, S2 As String)
Dim i As Integer, j As Integer
Dim l1 As Integer, l2 As Integer
Dim d() As Integer
Dim min1 As Integer, min2 As Integer
l1 = Len(S1)
l2 = Len(S2)
ReDim d(l1, l2)
For i = 0 To l1
d(i, 0) = i
Next
For j = 0 To l2
d(0, j) = j
Next
For i = 1 To l1
For j = 1 To l2
If Mid(S1, i, 1) = Mid(S2, j, 1) Then
d(i, j) = d(i - 1, j - 1)
Else
min1 = d(i - 1, j) + 1
min2 = d(i, j - 1) + 1
If min2 < min1 Then
min1 = min2
End If
min2 = d(i - 1, j - 1) + 1
If min2 < min1 Then
min1 = min2
End If
d(i, j) = min1
End If
Next
Next
Levenshtein = d(l1, l2)
End Function
Public Function LevenshteinCompare(S1 As Range, wordrange As Range)
Const treshold = 1
For Each S2 In Application.Intersect(wordrange, wordrange.Parent.UsedRange)
oldRes = newRes
newRes = Levenshtein(S1.Value, S2.Value)
If oldRes < newRes And oldRes <> "" Or S1.Address = S2.Address Then
newRes = oldRes
newS2row = oldS2row
Else
oldS2 = S2
oldS2row = S2.Address(0, 0)
End If
newS2 = oldS2
Next
If newRes <= treshold Then
LevenshteinCompare = newRes & " - [" & newS2row & "] " & newS2
Else
LevenshteinCompare = ""
End If
End Function
That was fun ☜(゚ヮ゚☜)