How to calculate distance similarity measure of given 2 strings?
I just addressed this exact same issue a few weeks ago. Since someone is asking now, I'll share the code. In my exhaustive tests my code is about 10x faster than the C# example on Wikipedia even when no maximum distance is supplied. When a maximum distance is supplied, this performance gain increases to 30x - 100x +. Note a couple key points for performance:
- If you need to compare the same words over and over, first convert the words to arrays of integers. The Damerau-Levenshtein algorithm includes many >, <, == comparisons, and
ints
compare much faster thanchars
. - It includes a short-circuiting mechanism to quit if the distance exceeds a provided maximum
- Use a rotating set of three arrays rather than a massive matrix as in all the implementations I've see elsewhere
- Make sure your arrays slice accross the shorter word width.
Code (it works the exact same if you replace int[]
with String
in the parameter declarations:
/// <summary>
/// Computes the Damerau-Levenshtein Distance between two strings, represented as arrays of
/// integers, where each integer represents the code point of a character in the source string.
/// Includes an optional threshhold which can be used to indicate the maximum allowable distance.
/// </summary>
/// <param name="source">An array of the code points of the first string</param>
/// <param name="target">An array of the code points of the second string</param>
/// <param name="threshold">Maximum allowable distance</param>
/// <returns>Int.MaxValue if threshhold exceeded; otherwise the Damerau-Leveshteim distance between the strings</returns>
public static int DamerauLevenshteinDistance(int[] source, int[] target, int threshold) {
int length1 = source.Length;
int length2 = target.Length;
// Return trivial case - difference in string lengths exceeds threshhold
if (Math.Abs(length1 - length2) > threshold) { return int.MaxValue; }
// Ensure arrays [i] / length1 use shorter length
if (length1 > length2) {
Swap(ref target, ref source);
Swap(ref length1, ref length2);
}
int maxi = length1;
int maxj = length2;
int[] dCurrent = new int[maxi + 1];
int[] dMinus1 = new int[maxi + 1];
int[] dMinus2 = new int[maxi + 1];
int[] dSwap;
for (int i = 0; i <= maxi; i++) { dCurrent[i] = i; }
int jm1 = 0, im1 = 0, im2 = -1;
for (int j = 1; j <= maxj; j++) {
// Rotate
dSwap = dMinus2;
dMinus2 = dMinus1;
dMinus1 = dCurrent;
dCurrent = dSwap;
// Initialize
int minDistance = int.MaxValue;
dCurrent[0] = j;
im1 = 0;
im2 = -1;
for (int i = 1; i <= maxi; i++) {
int cost = source[im1] == target[jm1] ? 0 : 1;
int del = dCurrent[im1] + 1;
int ins = dMinus1[i] + 1;
int sub = dMinus1[im1] + cost;
//Fastest execution for min value of 3 integers
int min = (del > ins) ? (ins > sub ? sub : ins) : (del > sub ? sub : del);
if (i > 1 && j > 1 && source[im2] == target[jm1] && source[im1] == target[j - 2])
min = Math.Min(min, dMinus2[im2] + cost);
dCurrent[i] = min;
if (min < minDistance) { minDistance = min; }
im1++;
im2++;
}
jm1++;
if (minDistance > threshold) { return int.MaxValue; }
}
int result = dCurrent[maxi];
return (result > threshold) ? int.MaxValue : result;
}
Where Swap
is:
static void Swap<T>(ref T arg1,ref T arg2) {
T temp = arg1;
arg1 = arg2;
arg2 = temp;
}
What you are looking for is called edit distance or Levenshtein distance. The wikipedia article explains how it is calculated, and has a nice piece of pseudocode at the bottom to help you code this algorithm in C# very easily.
Here's an implementation from the first site linked below:
private static int CalcLevenshteinDistance(string a, string b)
{
if (String.IsNullOrEmpty(a) && String.IsNullOrEmpty(b)) {
return 0;
}
if (String.IsNullOrEmpty(a)) {
return b.Length;
}
if (String.IsNullOrEmpty(b)) {
return a.Length;
}
int lengthA = a.Length;
int lengthB = b.Length;
var distances = new int[lengthA + 1, lengthB + 1];
for (int i = 0; i <= lengthA; distances[i, 0] = i++);
for (int j = 0; j <= lengthB; distances[0, j] = j++);
for (int i = 1; i <= lengthA; i++)
for (int j = 1; j <= lengthB; j++)
{
int cost = b[j - 1] == a[i - 1] ? 0 : 1;
distances[i, j] = Math.Min
(
Math.Min(distances[i - 1, j] + 1, distances[i, j - 1] + 1),
distances[i - 1, j - 1] + cost
);
}
return distances[lengthA, lengthB];
}
There is a big number of string similarity distance algorithms that can be used. Some listed here (but not exhaustively listed are):
- Levenstein
- Needleman Wunch
- Smith Waterman
- Smith Waterman Gotoh
- Jaro, Jaro Winkler
- Jaccard Similarity
- Euclidean Distance
- Dice Similarity
- Cosine Similarity
- Monge Elkan
A library that contains implementation to all of these is called SimMetrics which has both java and c# implementations.