Similar String algorithm

Solution 1:

Levenshtein distance will not completely work, because you want to allow rearrangements. I think your best bet is going to be to find best rearrangement with levenstein distance as cost for each word.

To find the cost of rearrangement, kinda like the pancake sorting problem. So, you can permute every combination of words (filtering out exact matches), with every combination of other string, trying to minimize a combination of permute distance and Levenshtein distance on each word pair.

edit: Now that I have a second I can post a quick example (all 'best' guesses are on inspection and not actually running the algorithms):

original strings             | best rearrangement w/ lev distance per word
Into the clear blue sky      |    Into the c_lear blue sky 
The color is sky blue        |    is__ the colo_r blue sky

R_dist = dist( 3 1 2 5 4 ) --> 3 1 2 *4 5* --> *2 1 3* 4 5 --> *1 2* 3 4 5 = 3  
L_dist = (2D+S) + (I+D+S) (Total Subsitutions: 2, deletions: 3, insertion: 1)  

(notice all the flips include all elements in the range, and I use ranges where Xi - Xj = +/- 1)

Other example

original strings             | best rearrangement w/ lev distance per word
Into the clear blue sky      |   Into the clear blue sky 
In the blue clear sky        |   In__ the clear blue sky

R_dist = dist( 1 2 4 3 5 ) -->  1 2 *3 4* 5  = 1
L_dist = (2D) (Total Subsitutions: 0, deletions: 2, insertion: 0)

And to show all possible combinations of the three...

The color is sky blue         |    The colo_r is sky blue
In the blue clear sky         |    the c_lear in sky blue

R_dist = dist( 2 4 1 3 5 ) --> *2 3 1 4* 5 --> *1 3 2* 4 5 --> 1 *2 3* 4 5 = 3
L_dist = (D+I+S) + (S) (Total Subsitutions: 2, deletions: 1, insertion: 1)

Anyway you make the cost function the second choice will be lowest cost, which is what you expected!

Solution 2:

One way to determine a measure of "overall similarity without respect to order" is to use some kind of compression-based distance. Basically, the way most compression algorithms (e.g. gzip) work is to scan along a string looking for string segments that have appeared earlier -- any time such a segment is found, it is replaced with an (offset, length) pair identifying the earlier segment to use. You can use measures of how well two strings compress to detect similarities between them.

Suppose you have a function string comp(string s) that returns a compressed version of s. You can then use the following expression as a "similarity score" between two strings s and t:

len(comp(s)) + len(comp(t)) - len(comp(s . t))

where . is taken to be concatenation. The idea is that you are measuring how much further you can compress t by looking at s first. If s == t, then len(comp(s . t)) will be barely any larger than len(comp(s)) and you'll get a high score, while if they are completely different, len(comp(s . t)) will be very near len(comp(s) + comp(t)) and you'll get a score near zero. Intermediate levels of similarity produce intermediate scores.

Actually the following formula is even better as it is symmetric (i.e. the score doesn't change depending on which string is s and which is t):

2 * (len(comp(s)) + len(comp(t))) - len(comp(s . t)) - len(comp(t . s))

This technique has its roots in information theory.

Advantages: good compression algorithms are already available, so you don't need to do much coding, and they run in linear time (or nearly so) so they're fast. By contrast, solutions involving all permutations of words grow super-exponentially in the number of words (although admittedly that may not be a problem in your case as you say you know there will only be a handful of words).

Solution 3:

One way (although this is perhaps better suited a spellcheck-type algorithm) is the "edit distance", ie., calculate how many edits it takes to transform one string to another. A common technique is found here:

http://en.wikipedia.org/wiki/Levenshtein_distance

Solution 4:

You might want to look into the algorithms used by biologists to compare DNA sequences, since they have to cope with many of the same things (chunks may be missing, or have been inserted, or just moved to a different position in the string.

The Smith-Waterman algorithm would be one example that'd probably work fairly well, although it might be too slow for your uses. Might give you a starting point, though.

Solution 5:

i had a similar problem, i needed to get the percentage of characters in a string that were similar. it needed exact sequences, so for example "hello sir" and "sir hello" when compared needed to give me five characters that are the same, in this case they would be the two "hello"'s. it would then take the length of the longest of the two strings and give me a percentage of how similar they were. this is the code that i came up with

int compare(string a, string b){
   return(a.size() > b.size() ? bigger(a,b) : bigger(b,a));
}



int bigger(string a, string b){



int maxcount = 0, currentcount = 0;//used to see which set of concurrent characters were biggest

for(int i = 0; i < a.size(); ++i){

    for(int j = 0; j < b.size(); ++j){

        if(a[i+j] == b[j]){

         ++currentcount;

         }

        else{

            if(currentcount > maxcount){

             maxcount = currentcount;

             }//end if

             currentcount = 0;

            }//end else

        }//end inner for loop

    }//end outer for loop


   return ((int)(((float)maxcount/((float)a.size()))*100));
}