What algorithm gives suggestions in a spell checker?
There is good essay by Peter Norvig how to implement a spelling corrector. It's basicly a brute force approach trying candidate strings with a given edit distance. (Here are some tips how you can improve the spelling corrector performance using a Bloom Filter and faster candidate hashing.)
The requirements for a spell checker are weaker. You have only to find out that a word is not in the dictionary. You can use a Bloom Filter to build a spell checker which consumes less memory. An ancient versions is decribed in Programming Pearls by Jon Bentley using 64kb for an english dictionary.
A BK-Tree is an alternative approach. A nice article is here.
Levenshstein distance is not exactly the right edit distance for a spell checker. It knows only insertion, deletion and substitution. Transposition is missing and produces 2 for a transposition of 1 character (it's 1 delete and 1 insertion). Damerau–Levenshtein distance is the right edit distance.
An approach for generating suggestions that I've used successfully but never seen described anywhere is to pre-compute suggestions (when building the dictionary) by using "bad" hash functions.
The idea is to look at the types of spelling errors people make, and to design hash functions that would assign an incorrect spelling to the same bucket as its correct spelling.
For example, a common mistake is to use the wrong vowel, like definate instead of definite. So you design a hash function that treats all vowels as the same letter. An easy way to do that is to first "normalize" the input word and then put the normalized result through a regular hash function. In this example, the normalizing function might drop all the vowels, so definite
becomes dfnt
. The "normalized" word is then hashed with a typical hash function.
Insert all of your dictionary words into an auxiliary index (hash table) using this special hash function. The buckets in this table will have longish collision lists because the hash function is "bad", but those collision lists are essentially pre-computed suggestions.
Now, when you find a misspelled word, you look up the collision lists for the bucket that the misspelling maps to in the auxiliary indexes. Ta da: You have a suggestion list! All you have to do is rank the words on it.
In practice, you'll need a few auxiliary indexes with other hash functions to handle other types of errors, like transposed letters, single/double letter, and even a simplistic Soundex-like one to catch phonetic misspellings. In practice, I found simplistic pronunciation ones to go a long way and essentially obsolete some of the ones designed to find trivial typos.
So now you look up misspellings in each of the auxiliary indexes and concatenate the collision lists before ranking.
Remember the collision lists contain only words that are in the dictionary. With approaches that try to generate alternate spellings (as in the Peter Norvig article), you can get (tens of) thousands of candidates that you first have to filter against the dictionary. With the pre-computed approach, you get maybe a couple hundred candidates, and you know that they're all correctly spelled, so you can skip straight to ranking.
Update: I've since found one algorithm description that's similar to this, the FAROO Distributed Search. This is still an edit-distance limited search, but it's very fast because the pre-calculation step works like my "bad hash functions" idea. FAROO just uses a limited concept of a bad hash function.
Algorithm
- Take a wrongly spelled word as input.
- Store the list of English words along with their frequencies in a text file.
- Insert all the available English words (stored in the text file) along with their frequencies (measure of how frequently a word is used in English language) in a Ternary Search Tree.
- Now traverse along the Ternary Search Tree -
- For each word encountered in the Ternary Search Tree, calculate its Levensthein Distance from the wrongly spelled word.
- If Levensthein Distance <= 3, store the word in a Priority Queue.
- If two words have same edit distance, the one with higher frequency is grater. Print the top 10 items from Priority Queue.
Optimization
- You can eleminate the words in subtree of current node if the edit distance of substring of input word from the current word is grater than the 3.
You can find the more detailed explanation and source code on github project.