Algorithm to find multiple string matches

I'm looking for suggestions for an efficient algorithm for finding all matches in a large body of text. Terms to search for will be contained in a list and can have 1000+ possibilities. The search terms may be 1 or more words.

Obviously I could make multiple passes through the text comparing against each search term. Not too efficient.

I've thought about ordering the search terms and combining common sub-segments. That way I could eliminate large numbers of terms quickly. Language is C++ and I can use boost.

An example of search terms could be a list of Fortune 500 company names.

Ideas?


Solution 1:

Don´t reinvent the wheel

This problem has been intensively researched. Curiously, the best algorithms for searching ONE pattern/string do not extrapolate easily to multi-string matching.

The "grep" family implement the multi-string search in a very efficient way. If you can use them as external programs, do it.

In case you really need to implement the algorithm, I think the fastest way is to reproduce what agrep does (agrep excels in multi-string matching!). Here are the source and executable files.

And here you will find a paper describing the algorithms used, the theoretical background, and a lot of information and pointers about string matching.

A note of caution: multiple-string matching have been heavily researched by people like Knuth, Boyer, Moore, Baeza-Yates, and others. If you need a really fast algorithm don't hesitate on standing on their broad shoulders. Don't reinvent the wheel.

Solution 2:

As in the case of single patterns, there are several algorithms for multiple-pattern matching, and you will have to find the one that fits your purpose best. The paper A fast algorithm for multi-pattern searching (archived copy) does a review of most of them, including Aho-Corasick (which is kind of the multi-pattern version of the Knuth-Morris-Pratt algorithm, with linear complexity) and Commentz-Walter (a combination of Boyer-Moore and Aho-Corasick), and introduces a new one, which uses ideas from Boyer-Moore for the task of matching multiple patterns.

An alternative, hash-based algorithm not mentioned in that paper, is the Rabin-Karp algorithm, which has a worst-case complexity bigger than other algorithms, but compensates it by reducing the linear factor via hashing. Which one is better depends ultimately on your use case. You may need to implement several of them and compare them in your application if you want to choose the fastest one.