Fuzzy String Comparison
What I am striving to complete is a program which reads in a file and will compare each sentence according to the original sentence. The sentence which is a perfect match to the original will receive a score of 1 and a sentence which is the total opposite will receive a 0. All other fuzzy sentences will receive a grade in between 1 and 0.
I am unsure which operation to use to allow me to complete this in Python 3.
I have included the sample text in which the Text 1 is the original and the other preceding strings are the comparisons.
Text: Sample
Text 1: It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats.
Text 20: It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines // Should score high point but not 1
Text 21: It was a murky and tempestuous night. I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines // Should score lower than text 20
Text 22: I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines. It was a murky and tempestuous night. // Should score lower than text 21 but NOT 0
Text 24: It was a dark and stormy night. I was not alone. I was not sitting on a red chair. I had three cats. // Should score a 0!
There is a package called fuzzywuzzy
. Install via pip:
pip install fuzzywuzzy
Simple usage:
>>> from fuzzywuzzy import fuzz
>>> fuzz.ratio("this is a test", "this is a test!")
96
The package is built on top of difflib
. Why not just use that, you ask? Apart from being a bit simpler, it has a number of different matching methods (like token order insensitivity, partial string matching) which make it more powerful in practice. The process.extract
functions are especially useful: find the best matching strings and ratios from a set. From their readme:
Partial Ratio
>>> fuzz.partial_ratio("this is a test", "this is a test!")
100
Token Sort Ratio
>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
90
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
100
Token Set Ratio
>>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
84
>>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
100
Process
>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
>>> process.extract("new york jets", choices, limit=2)
[('New York Jets', 100), ('New York Giants', 78)]
>>> process.extractOne("cowboys", choices)
("Dallas Cowboys", 90)
There is a module in the standard library (called difflib
) that can compare strings and return a score based on their similarity. The SequenceMatcher
class should do what you want.
Small example from Python prompt:
>>> from difflib import SequenceMatcher as SM
>>> s1 = ' It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats.'
>>> s2 = ' It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines.'
>>> SM(None, s1, s2).ratio()
0.9112903225806451
fuzzyset
is much faster than fuzzywuzzy
(difflib
) for both indexing and searching.
from fuzzyset import FuzzySet
corpus = """It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines
It was a murky and tempestuous night. I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines
I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines. It was a murky and tempestuous night.
It was a dark and stormy night. I was not alone. I was not sitting on a red chair. I had three cats."""
corpus = [line.lstrip() for line in corpus.split("\n")]
fs = FuzzySet(corpus)
query = "It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats."
fs.get(query)
# [(0.873015873015873, 'It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines')]
Warning: Be careful not to mix unicode
and bytes
in your fuzzyset.