Formula for finding 'near perfect' matches [duplicate]

I have a list of +-8000 items which is the result after applying numerous techniques to remove duplicates (the list started at over 10000 items).

I now have the following issue -

Example 1. Jack Daniel 2. Jack Daniels 3. Jack Daniel's

Clearly all of the above relate to one item but they are still technically unique. I tried pulling the first 4 letters and checking for matches however there were +-4000 and most of them were false positives i.e.

  1. Jack Dxxx
  2. Jack Bxxx

Both would pull Jack but would not be a valid duplicate.

Any thoughts?


Based on Zoredache's comment, here is my sample workbook using VBA and the Levenshtein Distance to find similar strings within a big list. It's based on @smirkingman and @Apostolos55 answers on stackoverflow.

The Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other

I implemented two different versions. Please check which function is faster for your case with 8000 values. If you're curious, view the full VBA code on Github. Raise the threshold in the line const treshold = 1 if you want results with more than 1 required edit to get a match somewhere.

enter image description here

  • Formula syntax: =LevenshteinCompare( <cell_to_check> , <range_to_search_in> )
    Example: =LevenshteinCompare(A2;A$2:A$12) (Notice the fixed range)
  • Output syntax: <number_of_required_edits> - [<match_address>] <match_value>

Private Function Levenshtein(S1 As String, S2 As String)

Dim i As Integer, j As Integer
Dim l1 As Integer, l2 As Integer
Dim d() As Integer
Dim min1 As Integer, min2 As Integer

l1 = Len(S1)
l2 = Len(S2)
ReDim d(l1, l2)
For i = 0 To l1
    d(i, 0) = i
Next
For j = 0 To l2
    d(0, j) = j
Next
For i = 1 To l1
    For j = 1 To l2
        If Mid(S1, i, 1) = Mid(S2, j, 1) Then
            d(i, j) = d(i - 1, j - 1)
        Else
            min1 = d(i - 1, j) + 1
            min2 = d(i, j - 1) + 1
            If min2 < min1 Then
                min1 = min2
            End If
            min2 = d(i - 1, j - 1) + 1
            If min2 < min1 Then
                min1 = min2
            End If
            d(i, j) = min1
        End If
    Next
Next
Levenshtein = d(l1, l2)
End Function


Public Function LevenshteinCompare(S1 As Range, wordrange As Range)

Const treshold = 1
For Each S2 In Application.Intersect(wordrange, wordrange.Parent.UsedRange)
    oldRes = newRes
    newRes = Levenshtein(S1.Value, S2.Value)
    If oldRes < newRes And oldRes <> "" Or S1.Address = S2.Address Then
        newRes = oldRes
        newS2row = oldS2row
    Else
        oldS2 = S2
        oldS2row = S2.Address(0, 0)
    End If
    newS2 = oldS2
Next

If newRes <= treshold Then
    LevenshteinCompare = newRes & " - [" & newS2row & "]  " & newS2
Else
    LevenshteinCompare = ""
End If
End Function

That was fun ☜(゚ヮ゚☜)