String similarity score/hash

Solution 1:

I believe what you're looking for is called a Locality Sensitive Hash. Whereas most hash algorithms are designed such that small variations in input cause large changes in output, these hashes attempt the opposite: small changes in input generate proportionally small changes in output.

As others have mentioned, there are inherent issues with forcing a multi-dimensional mapping into a 2-dimensional mapping. It's analogous to creating a flat map of the Earth... you can never accurately represent a sphere on a flat surface. Best you can do is find a LSH that is optimized for whatever feature it is you're using to determine whether strings are "alike".

Solution 2:

Levenstein distance or its derivatives is the algorithm you want. Match given string to each of strings from dictionary. (Here, if you need only fixed number of most similar strings, you may want to use min-heap.) If running Levenstein distance for all strings in dictionary is too expensive, then use some rough algorithm first that will exclude too distant words from list of candidates. After that, run levenstein distance on left candidates.


One way to remove distant words is to index n-grams. Preprocess dictionary by splitting each of words into list of n-grams. For example, consider n=3:

(0) "Hello world" -> ["Hel", "ell", "llo", "lo ", "o w", " wo", "wor", "orl", "rld"]
(1) "FooBarbar" -> ["Foo", "ooB", "oBa", "Bar", "arb", "rba", "bar"]
(2) "Foo world!" -> ["Foo", "oo ", "o w", " wo", "wor", "orl", "rld", "ld!"]

Next, create index of n-gramms:

" wo" -> [0, 2]
"Bar" -> [1]
"Foo" -> [1, 2]
"Hel" -> [0]
"arb" -> [1]
"bar" -> [1]
"ell" -> [0]
"ld!" -> [2]
"llo" -> [0]
"lo " -> [0]
"o w" -> [0, 2]
"oBa" -> [1]
"oo " -> [2]
"ooB" -> [1]
"orl" -> [0, 2]
"rba" -> [1]
"rld" -> [0, 2]
"wor" -> [0, 2]

When you need to find most similar strings for given string, you split given string into n-grams and select only those words from dictionary which have at least one matching n-gram. This reduces number of candidates to reasonable amount and you may proceed with levenstein-matching given string to each of left candidates.


If your strings are long enough, you may reduce index size by using min-hashing technnique: you calculate ordinary hash for each of n-grams and use only K smallest hashes, others are thrown away.

P.S. this presentation seems like a good introduction to your problem.

Solution 3:

This isn't possible, in general, because the set of edit distances between strings forms a metric space, but not one with a fixed dimension. That means that you can't provide a mapping between strings and integers that preserves a distance measure between them.

For example, you cannot assign numbers to these three phrases:

  • one two
  • one six
  • two six

Such that the numbers reflect the difference between all three phrases.