Checking fuzzy/approximate substring existing in a longer string, in Python?

The new regex library that's soon supposed to replace re includes fuzzy matching.

https://pypi.python.org/pypi/regex/

The fuzzy matching syntax looks fairly expressive, but this would give you a match with one or fewer insertions/additions/deletions.

import regex
regex.match('(amazing){e<=1}', 'amaging')

How about using difflib.SequenceMatcher.get_matching_blocks?

>>> import difflib
>>> large_string = "thelargemanhatanproject"
>>> query_string = "manhattan"
>>> s = difflib.SequenceMatcher(None, large_string, query_string)
>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))
0.8888888888888888

>>> query_string = "banana"
>>> s = difflib.SequenceMatcher(None, large_string, query_string)
>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))
0.6666666666666666

UPDATE

import difflib

def matches(large_string, query_string, threshold):
    words = large_string.split()
    for word in words:
        s = difflib.SequenceMatcher(None, word, query_string)
        match = ''.join(word[i:i+n] for i, j, n in s.get_matching_blocks() if n)
        if len(match) / float(len(query_string)) >= threshold:
            yield match

large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
print list(matches(large_string, query_string, 0.8))

Above code print: ['manhatan', 'manhattn']

Checking fuzzy/approximate substring existing in a longer string, in Python?

Related

Recent Posts