String similarity metrics in Python
I want to find string similarity between two strings. This page has examples of some of them. Python has an implemnetation of Levenshtein algorithm. Is there a better algorithm, (and hopefully a python library), under these contraints.
- I want to do fuzzy matches between strings. eg matches('Hello, All you people', 'hello, all You peopl') should return True
- False negatives are acceptable, False positives, except in extremely rare cases are not.
- This is done in a non realtime setting, so speed is not (much) of concern.
- [Edit] I am comparing multi word strings.
Would something other than Levenshtein distance(or Levenshtein ratio) be a better algorithm for my case?
Solution 1:
I realize it's not the same thing, but this is close enough:
>>> import difflib
>>> a = 'Hello, All you people'
>>> b = 'hello, all You peopl'
>>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower())
>>> seq.ratio()
0.97560975609756095
You can make this as a function
def similar(seq1, seq2):
return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9
>>> similar(a, b)
True
>>> similar('Hello, world', 'Hi, world')
False
Solution 2:
There's a great resource for string similarity metrics at the University of Sheffield. It has a list of various metrics (beyond just Levenshtein) and has open-source implementations of them. Looks like many of them should be easy to adapt into Python.
http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
Here's a bit of the list:
- Hamming distance
- Levenshtein distance
- Needleman-Wunch distance or Sellers Algorithm
- and many more...
Solution 3:
This snippet will calculate the difflib, Levenshtein, Sørensen, and Jaccard similarity values for two strings. In the snippet below, I was iterating over a tsv in which the strings of interest occupied columns [3]
and [4]
of the tsv. (pip install python-Levenshtein
and pip install distance
):
import codecs, difflib, Levenshtein, distance
with codecs.open("titles.tsv","r","utf-8") as f:
title_list = f.read().split("\n")[:-1]
for row in title_list:
sr = row.lower().split("\t")
diffl = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio()
lev = Levenshtein.ratio(sr[3], sr[4])
sor = 1 - distance.sorensen(sr[3], sr[4])
jac = 1 - distance.jaccard(sr[3], sr[4])
print diffl, lev, sor, jac
Solution 4:
I would use Levenshtein distance, or the so-called Damerau distance (which takes transpositions into account) rather than the difflib stuff for two reasons (1) "fast enough" (dynamic programming algo) and "whoooosh" (bit-bashing) C code is available and (2) well-understood behaviour e.g. Levenshtein satisfies the triangle inequality and thus can be used in e.g. a Burkhard-Keller tree.
Threshold: you should treat as "positive" only those cases where distance < (1 - X) * max(len(string1), len(string2)) and adjust X (the similarity factor) to suit yourself. One way of choosing X is to get a sample of matches, calculate X for each, ignore cases where X < say 0.8 or 0.9, then sort the remainder in descending order of X and eye-ball them and insert the correct result and calculate some cost-of-mistakes measure for various levels of X.
N.B. Your ape/apple example has distance 2, so X is 0.6 ... I would only use a threshold as low as 0.75 if I were desperately looking for something and had a high false-negative penalty
Solution 5:
Is that what you mean?
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']
look at http://docs.python.org/library/difflib.html#difflib.get_close_matches