Python - difference between two strings
I'd like to store a lot of words in a list. Many of these words are very similar. For example I have word afrykanerskojęzyczny
and many of words like afrykanerskojęzycznym
, afrykanerskojęzyczni
, nieafrykanerskojęzyczni
. What is the effective (fast and giving small diff size) solution to find difference between two strings and restore second string from the first one and diff?
Solution 1:
You can use ndiff in the difflib module to do this. It has all the information necessary to convert one string into another string.
A simple example:
import difflib
cases=[('afrykanerskojęzyczny', 'afrykanerskojęzycznym'),
('afrykanerskojęzyczni', 'nieafrykanerskojęzyczni'),
('afrykanerskojęzycznym', 'afrykanerskojęzyczny'),
('nieafrykanerskojęzyczni', 'afrykanerskojęzyczni'),
('nieafrynerskojęzyczni', 'afrykanerskojzyczni'),
('abcdefg','xac')]
for a,b in cases:
print('{} => {}'.format(a,b))
for i,s in enumerate(difflib.ndiff(a, b)):
if s[0]==' ': continue
elif s[0]=='-':
print(u'Delete "{}" from position {}'.format(s[-1],i))
elif s[0]=='+':
print(u'Add "{}" to position {}'.format(s[-1],i))
print()
prints:
afrykanerskojęzyczny => afrykanerskojęzycznym
Add "m" to position 20
afrykanerskojęzyczni => nieafrykanerskojęzyczni
Add "n" to position 0
Add "i" to position 1
Add "e" to position 2
afrykanerskojęzycznym => afrykanerskojęzyczny
Delete "m" from position 20
nieafrykanerskojęzyczni => afrykanerskojęzyczni
Delete "n" from position 0
Delete "i" from position 1
Delete "e" from position 2
nieafrynerskojęzyczni => afrykanerskojzyczni
Delete "n" from position 0
Delete "i" from position 1
Delete "e" from position 2
Add "k" to position 7
Add "a" to position 8
Delete "ę" from position 16
abcdefg => xac
Add "x" to position 0
Delete "b" from position 2
Delete "d" from position 4
Delete "e" from position 5
Delete "f" from position 6
Delete "g" from position 7
Solution 2:
I like the ndiff answer, but if you want to spit it all into a list of only the changes, you could do something like:
import difflib
case_a = 'afrykbnerskojęzyczny'
case_b = 'afrykanerskojęzycznym'
output_list = [li for li in difflib.ndiff(case_a, case_b) if li[0] != ' ']
Solution 3:
You can look into the regex module (the fuzzy section). I don't know if you can get the actual differences, but at least you can specify allowed number of different types of changes like insert, delete, and substitutions:
import regex
sequence = 'afrykanerskojezyczny'
queries = [ 'afrykanerskojezycznym', 'afrykanerskojezyczni',
'nieafrykanerskojezyczni' ]
for q in queries:
m = regex.search(r'(%s){e<=2}'%q, sequence)
print 'match' if m else 'nomatch'