The fastest C++ algorithm for string testing against a list of predefined seeds (case insensitive)

If there are a finite amount of matching strings, this means that you can construct a tree such that, read from root to leaves, similar strings will occupy similar branches.

This is also known as a trie, or Radix Tree.

For example, we might have the strings cat, coach, con, conch as well as dark, dad, dank, do. Their trie might look like this:

enter image description here

A search for one of the words in the tree will search the tree, starting from a root. Making it to a leaf would correspond to a match to a seed. Regardless, each character in the string should match to one of their children. If it does not, you can terminate the search (e.g. you would not consider any words starting with "g" or any words beginning with "cu").

There are various algorithms for constructing the tree as well as searching it as well as modifying it on the fly, but I thought I would give a conceptual overview of the solution instead of a specific one since I don't know of the best algorithm for it.

Conceptually, an algorithm you might use to search the tree would be related to the idea behind radix sort of a fixed amount of categories or values that a character in a string might take on at a given point in time.

This lets you check one word against your word-list. Since you're looking for this word-list as sub-strings of your input string, there's going to be more to it than this.

Edit: As other answers have mentioned, the Aho-Corasick algorithm for string matching is a sophisticated algorithm for performing string matching, consisting of a trie with additional links for taking "shortcuts" through the tree and having a different search pattern to accompany this. (As an interesting note, Alfred Aho is also a contributor to the the popular compiler textbook, Compilers: Principles, Techniques, and Tools as well as the algorithms textbook, The Design And Analysis Of Computer Algorithms. He is also a former member of Bell Labs. Margaret J. Corasick does not seem to have too much public information on herself.)


You can use Aho–Corasick algorithm

It builds trie/automaton where some vertices marked as terminal which would mean string has seeds.

It's built in O(sum of dictionary word lengths) and gives the answer in O(test string length)

Advantages:

  • It's specifically works with several dictionary words and check time doesn't depend on number of words (If we not consider cases where it doesn't fit to memory etc)
  • The algorithm is not hard to implement (comparing to suffix structures at least)

You may make it case insensitive by lowering each symbol if it's ASCII (non ASCII chars don't match anyway)