Good algorithm and data structure for looking up words with missing letters?

I need to write an efficient algorithm for looking up words with missing letters in a dictionary and I want the set of possible words.

For example, if I have th??e, I might get back "these", "those", "theme:, "there", etc.

There will be up to TWO question marks and when two question marks do occur, they will occur in sequence.

I was wondering if anyone can suggest some data structures or algorithm I should use.

A Trie is too space-inefficient and would make it too slow. Any other ideas modifications?

Currently I am using 3 hash tables for when it is an exact match, 1 question mark, and 2 question marks. Given a dictionary I hash all the possible words. For example, if I have the word WORD. I hash WORD, ?ORD, W?RD, WO?D, WOR?, ??RD, W??D, and WO?? into the dictionary. Then I use a link list to link the collisions together. So say hash(W?RD) = hash(STR?NG) = 17. hashtab(17) will point to WORD and WORD points to STRING because it is a linked list.

The timing on average lookup of one word is about 2e-6s. I am looking to do better, preferably on the order of 1e-9. It took 0.5 seconds for 3m entries insertions and it took 4 seconds for 3m entries lookup.


Solution 1:

I believe in this case it is best to just use a flat file where each word stands in one line. With this you can conveniently use the power of a regular expression search, which is highly optimized and will probably beat any data structure you can devise yourself for this problem.

Solution #1: Using Regex

This is working Ruby code for this problem:

def query(str, data)    
  r = Regexp.new("^#{str.gsub("?", ".")}$")
  idx = 0
  begin
    idx = data.index(r, idx)
    if idx
      yield data[idx, str.size]
      idx += str.size + 1
    end
  end while idx
end

start_time = Time.now
query("?r?te", File.read("wordlist.txt")) do |w|
  puts w
end
puts Time.now - start_time

The file wordlist.txt contains 45425 words (downloadable here). The program's output for query ?r?te is:

brute
crate
Crete
grate
irate
prate
write
wrote
0.013689

So it takes just 37 milliseconds to both read the whole file and to find all matches in it. And it scales very well for all kinds of query patterns, even where a Trie is very slow:

query ????????????????e

counterproductive
indistinguishable
microarchitecture
microprogrammable
0.018681

query ?h?a?r?c?l?

theatricals
0.013608

This looks fast enough for me.

Solution #2: Regex with Prepared Data

If you want to go even faster, you can split the wordlist into strings that contain words of equal lengths and just search the correct one based on your query length. Replace the last 5 lines with this code:

def query_split(str, data)
  query(str, data[str.length]) do |w|
    yield w
  end
end

# prepare data    
data = Hash.new("")
File.read("wordlist.txt").each_line do |w|
  data[w.length-1] += w
end

# use prepared data for query
start_time = Time.now
query_split("?r?te", data) do |w|
  puts w
end
puts Time.now - start_time

Building the data structure takes now about 0.4 second, but all queries are about 10 times faster (depending on the number of words with that length):

  • ?r?te 0.001112 sec
  • ?h?a?r?c?l? 0.000852 sec
  • ????????????????e 0.000169 sec

Solution #3: One Big Hashtable (Updated Requirements)

Since you have changed your requirements, you can easily expand on your idea to use just one big hashtable that contains all precalculated results. But instead of working around collisions yourself you could rely on the performance of a properly implemented hashtable.

Here I create one big hashtable, where each possible query maps to a list of its results:

def create_big_hash(data)
  h = Hash.new do |h,k|
    h[k] = Array.new
  end    
  data.each_line do |l|
    w = l.strip
    # add all words with one ?
    w.length.times do |i|
      q = String.new(w)
      q[i] = "?"
      h[q].push w
    end
    # add all words with two ??
    (w.length-1).times do |i|
      q = String.new(w)      
      q[i, 2] = "??"
      h[q].push w
    end
  end
  h
end

# prepare data    
t = Time.new
h = create_big_hash(File.read("wordlist.txt"))
puts "#{Time.new - t} sec preparing data\n#{h.size} entries in big hash"

# use prepared data for query
t = Time.new
h["?ood"].each do |w|
  puts w
end
puts (Time.new - t)

Output is

4.960255 sec preparing data
616745 entries in big hash
food
good
hood
mood
wood
2.0e-05

The query performance is O(1), it is just a lookup in the hashtable. The time 2.0e-05 is probably below the timer's precision. When running it 1000 times, I get an average of 1.958e-6 seconds per query. To get it faster, I would switch to C++ and use the Google Sparse Hash which is extremely memory efficient, and fast.

Solution #4: Get Really Serious

All above solutions work and should be good enough for many use cases. If you really want to get serious and have lots of spare time on your hands, read some good papers:

  • Tries for Approximate String Matching - If well implemented, tries can have very compact memory requirements (50% less space than the dictionary itself), and are very fast.
  • Agrep - A Fast Approximate Pattern-Matching Tool - Agrep is based on a new efficient and flexible algorithm for approximate string matching.
  • Google Scholar search for approximate string matching - More than enough to read on this topic.

Solution 2:

Given the current limitations:

  • There will be up to 2 question marks
  • When there are 2 question marks, they appear together
  • There are ~100,000 words in the dictionary, average word length is 6.

I have two viable solutions for you:

The fast solution: HASH

You can use a hash which keys are your words with up to two '?', and the values are a list of fitting words. This hash will have around 100,000 + 100,000*6 + 100,000*5 = 1,200,000 entries (if you have 2 question marks, you just need to find the place of the first one...). Each entry can save a list of words, or a list of pointers to the existing words. If you save a list of pointers, and we assume that there are on average less than 20 words matching each word with two '?', then the additional memory is less than 20 * 1,200,000 = 24,000,000.

If each pointer size is 4 bytes, then the memory requirement here is (24,000,000+1,200,000)*4 bytes = 100,800,000 bytes ~= 96 mega bytes.

To sum up this solution:

  • Memory Consumption: ~96 MB
  • Time for each search: calculating a hash function, and following a pointer. O(1)

Note: if you want to use a hash of a smaller size, you can, but then it is better to save a balanced search tree in each entry instead of a linked list, for better performance.

The space savvy, but still very fast solution: TRIE variation

This solution uses the following observation:

If the '?' signs were at the end of the word, trie would be an excellent solution.

The search in the trie would search at the length of the word, and for the last couple of letters, a DFS traversal would bring all of the endings. Very fast, and very memory-savvy solution.

So lets use this observation, in order to build something to work exactly like this.

You can think about every word you have in the dictionary, as a word ending with @ (or any other symbol that does not exist in your dictionary). So the word 'space' would be 'space@'. Now, if you rotate each of the words, with the '@' sign, you get the following:

space@, pace@s, ace@sp, *ce@spa*, e@spac

(no @ as first letter).

If you insert all of these variations into a TRIE, you can easily find the word you are seeking at the length of the word, by 'rotating' your word.

Example: You want to find all words that fit 's??ce' (one of them is space, another is slice). You build the word: s??ce@, and rotate it so that the ? sign is in the end. i.e. 'ce@s??'

All of the rotation variations exist inside the trie, and specifically 'ce@spa' (marked with * above). After the beginning is found - you need to go over all of the continuations in the appropriate length, and save them. Then, you need to rotate them again so that the @ is the last letter, and walla - you have all of the words you were looking for!

To sum up this solution:

  • Memory Consumption: For each word, all of its rotations appear in the trie. On average, *6 of the memory size is saved in the trie. The trie size is around *3 (just guessing...) of the space saved inside it. So the total space necessary for this trie is 6*3*100,000 = 1,800,000 words ~= 6.8 mega bytes.

  • Time for each search:

    • rotating the word: O(word length)
    • seeking the beginning in the trie: O(word length)
    • going over all of the endings: O(number of matches)
    • rotating the endings: O(total length of answers)

    To sum up, it is very very fast, and depends on the word length * small constant.

To sum up...

The second choice has a great time/space complexity, and would be the best option for you to use. There are a few problems with the second solution (in which case you might want to use the first solution):

  • More complex to implement. I'm not sure whether there are programming languages with tries built-in out of the box. If there isn't - it means that you'll need to implement it yourself...
  • Does not scale well. If tomorrow you decide that you need your question marks spread all over the word, and not necessarily joined together, you'll need to think hard of how to fit the second solution to it. In the case of the first solution - it is quite easy to generalize.

Solution 3:

To me this problem sounds like a good fit for a Trie data structure. Enter the entire dictionary into your trie, and then look up the word. For a missing letter you would have to try all sub-tries, which should be relatively easy to do with a recursive approach.

EDIT: I wrote a simple implementation of this in Ruby just now: http://gist.github.com/262667.

Solution 4:

Directed Acyclic Word Graph would be perfect data structure for this problem. It combines efficiency of a trie (trie can be seen as a special case of DAWG), but is much more space efficient. Typical DAWG will take fraction of size that plain text file with words would take.

Enumerating words that meet specific conditions is simple and the same as in trie - you have to traverse graph in depth-first fashion.