Is there any program for fuzzy string matching, which provides a match score?
I have list of strings in file A
and file B
. I want to take each string in file A and find the most similar string in file B.
For this, I am looking for a tool that provides fuzzy comparing.
for example:
$ fuzzy_compare "Some string" "Some string"
100
Where 100 is some equality ratio. For example Levenshtein distance.
Is there any utility? I don't want to reinvent the wheel.
Solution 1:
I found this page which provides implementations of the Levenshtein distance algorithm in different languages. So, for example in bash, you could do:
#!/bin/bash
function levenshtein {
if [ "$#" -ne "2" ]; then
echo "Usage: $0 word1 word2" >&2
elif [ "${#1}" -lt "${#2}" ]; then
levenshtein "$2" "$1"
else
local str1len=$((${#1}))
local str2len=$((${#2}))
local d i j
for i in $(seq 0 $(((str1len+1)*(str2len+1)))); do
d[i]=0
done
for i in $(seq 0 $((str1len))); do
d[$((i+0*str1len))]=$i
done
for j in $(seq 0 $((str2len))); do
d[$((0+j*(str1len+1)))]=$j
done
for j in $(seq 1 $((str2len))); do
for i in $(seq 1 $((str1len))); do
[ "${1:i-1:1}" = "${2:j-1:1}" ] && local cost=0 || local cost=1
local del=$((d[(i-1)+str1len*j]+1))
local ins=$((d[i+str1len*(j-1)]+1))
local alt=$((d[(i-1)+str1len*(j-1)]+cost))
d[i+str1len*j]=$(echo -e "$del\n$ins\n$alt" | sort -n | head -1)
done
done
echo ${d[str1len+str1len*(str2len)]}
fi
}
while read str1; do
while read str2; do
lev=$(levenshtein "$str1" "$str2");
printf '%s / %s : %s\n' "$str1" "$str2" "$lev"
done < "$2"
done < "$1"
Save that as ~/bin/levenshtein.sh
, make it executable (chmod a+x ~/bin/levenshtein.sh
) and run it on your two files. For example:
$ cat fileA
foo
zoo
bar
fob
baar
$ cat fileB
foo
loo
baar
bob
gaf
$ a.sh fileA fileB
foo / foo : 0
foo / loo : 1
foo / baar : 4
foo / bob : 2
foo / gaf : 3
zoo / foo : 1
zoo / loo : 1
zoo / baar : 4
zoo / bob : 2
zoo / gaf : 3
bar / foo : 3
bar / loo : 3
bar / baar : 1
bar / bob : 2
bar / gaf : 2
fob / foo : 1
fob / loo : 2
fob / baar : 4
fob / bob : 1
fob / gaf : 3
baar / foo : 4
baar / loo : 4
baar / baar : 0
baar / bob : 3
baar / gaf : 3
That's fine for a few patterns but will get very slow for larger files. If that's an issue, try one of the implementations in other languages. For example Perl:
#!/usr/bin/perl
use List::Util qw(min);
sub levenshtein
{
my ($str1, $str2) = @_;
my @ar1 = split //, $str1;
my @ar2 = split //, $str2;
my @dist;
$dist[$_][0] = $_ foreach (0 .. @ar1);
$dist[0][$_] = $_ foreach (0 .. @ar2);
foreach my $i (1 .. @ar1) {
foreach my $j (1 .. @ar2) {
my $cost = $ar1[$i - 1] eq $ar2[$j - 1] ? 0 : 1;
$dist[$i][$j] = min(
$dist[$i - 1][$j] + 1,
$dist[$i][$j - 1] + 1,
$dist[$i - 1][$j - 1] + $cost
);
}
}
return $dist[@ar1][@ar2];
}
open(my $fh1, "$ARGV[0]");
open(my $fh2, "$ARGV[1]");
chomp(my @strings1=<$fh1>);
chomp(my @strings2=<$fh2>);
foreach my $str1 (@strings1) {
foreach my $str2 (@strings2) {
my $lev=levenshtein($str1, $str2);
print "$str1 / $str2 : $lev\n";
}
}
As above, save the script as ~/bin/levenshtein.pl
and make it executable and run it with the two files as arguments:
~/bin/levenstein.pl fileA fileB
Even in the very small files used here, the Perl approach is 10 times faster than the bash one:
$ time levenshtein.sh fileA fileB > /dev/null
real 0m0.965s
user 0m0.070s
sys 0m0.057s
$ time levenshtein.pl fileA fileB > /dev/null
real 0m0.011s
user 0m0.010s
sys 0m0.000s
Solution 2:
I am aware that this is a very old thread but I found it because I was looking for the same thing. And I found it:
The fstrcmp
command will do exactly what you are asking:
> fstrcmp "Some string" "Some string"
1.0000
> fstrcmp "Some string" "Other string"
0.6957
> fstrcmp "Some string" "Pangalactic garglebalster"
0.2222
According to the manpage, the resulting numbers are the edit distance which I take to be the same as Levenshtein distance.