Matching strings within two lists
Especially given your comment about doing this on large lists (10.000 * 1.200), I would recommend the usage of RapidFuzz (I am the author). A solution using RapidFuzz
could be achieved in the following way:
from rapidfuzz import process, fuzz
import numpy as np
list1 = ["Real Madrid", "Benfica", "Lazio", "FC Milan"]
list2 = ["Madrid", "Barcelona", "Milan"]
scores = process.cdist(
list1, list2, scorer=fuzz.ratio,
dtype=np.uint8, score_cutoff=60)
# scores is array([[71, 0, 0],
# [ 0, 0, 0],
# [ 0, 0, 0],
# [ 0, 0, 77]], dtype=uint8)
matches = np.any(scores, 1)
# matches is array([ True, False, False, True])
This still processes the whole N*M matrix, but it is significantly faster than doing the same using fuzzywuzzy
/thefuzz
. When working with really large lists it is possible to enable multithreading in process.cdist
by passing the named argument workers
(e.g. workers=-1
to use all available cores). The results above could be converted to the lists you showed in the example if that is needed:
matching = [x for x, is_match in zip(list1, matches) if is_match]
# ['Real Madrid', 'FC Milan']
not_matching = [x for x, is_match in zip(list1, matches) if not is_match]
# ['Benfica', 'Lazio']
I benchmarked this solution on an i7-8550U using two large lists (10.000 * 1.200):
print(timeit(
"""
scores = process.cdist(
list1, list2, scorer=fuzz.ratio,
dtype=np.uint8, score_cutoff=60)
matches = np.any(scores, 1)
matching = [x for x, is_match in zip(list1, matches) if is_match]
not_matching = [x for x, is_match in zip(list1, matches) if not is_match]
""",
setup="""
from rapidfuzz import process, fuzz
import numpy as np
list1 = ["Real Madrid", "Benfica", "Lazio", "FC Milan"] * 2500
list2 = ["Madrid", "Barcelona", "Milan"] * 400
""", number=1
))
which took 0.33 seconds. Using workers=-1
reduced the runtime to 0.08 seconds.
There are duplicate combinations in list1 and list2 that created copies in the no_matching
list. Check if the element is already in the matching list. If yes, don't add to the no_matching
list. The below code gives the expected output.
from fuzzywuzzy import fuzz
def Matching(list1, list2):
no_matching = []
matching = []
m_score = 0
for item1 in list1:
for item2 in list2:
m_score = fuzz.ratio(item1, item2)
if m_score > 60:
matching.append(item1)
if m_score < 60 and not(item1 in matching):
no_matching.append(item1)
return(matching, no_matching)
list1 = ["Real Madrid", "Benfica", "Lazio", "FC Milan"]
list2 = ["Madrid", "Barcelona", "Milan"]
print(Matching(list1, list2))
Output:
(['Real Madrid', 'FC Milan'], ['Benfica', 'Lazio'])
Edit: instead of run 2 for loops, you can run over all he combinations:
import itertools
new_list = list(itertools.product(list1, list2))
output:
[('Real Madrid', 'Madrid'), ('Real Madrid', 'Barcelona'), ('Real Madrid', 'Milan'), ('Benfica', 'Madrid'), ('Benfica', 'Barcelona'), ('Benfica', 'Milan'), ('Lazio', 'Madrid'), ('Lazio', 'Barcelona'), ('Lazio', 'Milan'), ('FC Milan', 'Madrid'), ('FC Milan', 'Barcelona'), ('FC Milan', 'Milan')]
you have a problem with indentation:
from fuzzywuzzy import fuzz
def Matching(list1, list2):
no_matching = []
matching = []
m_score = 0
for item1 in list1:
for item2 in list2:
m_score = fuzz.ratio(item1, item2)
if m_score > 60:
matching.append(item1)
if m_score < 60 and not(item1 in matching):
no_matching.append(item1)
return(matching, no_matching)