Efficiently querying one string against multiple regexes

Lets say that I have 10,000 regexes and one string and I want to find out if the string matches any of them and get all the matches. The trivial way to do it would be to just query the string one by one against all regexes. Is there a faster,more efficient way to do it?

EDIT: I have tried substituting it with DFA's (lex) The problem here is that it would only give you one single pattern. If I have a string "hello" and patterns "[H|h]ello" and ".{0,20}ello", DFA will only match one of them, but I want both of them to hit.


We had to do this on a product I worked on once. The answer was to compile all your regexes together into a Deterministic Finite State Machine (also known as a deterministic finite automaton or DFA). The DFA could then be walked character by character over your string and would fire a "match" event whenever one of the expressions matched.

Advantages are it runs fast (each character is compared only once) and does not get any slower if you add more expressions.

Disadvantages are that it requires a huge data table for the automaton, and there are many types of regular expressions that are not supported (for instance, back-references).

The one we used was hand-coded by a C++ template nut in our company at the time, so unfortunately I don't have any FOSS solutions to point you toward. But if you google regex or regular expression with "DFA" you'll find stuff that will point you in the right direction.


This is the way lexers work.

The regular expressions are converted into a single non deterministic automata (NFA) and possibily transformed in a deterministic automata (DFA).

The resulting automaton will try to match all the regular expressions at once and will succeed on one of them.

There are many tools that can help you here, they are called "lexer generator" and there are solutions that work with most of the languages.

You don't say which language are you using. For C programmers I would suggest to have a look at the re2c tool. Of course the traditional (f)lex is always an option.


I've come across a similar problem in the past. I used a solution similar to the one suggested by akdom.

I was lucky in that my regular expressions usually had some substring that must appear in every string it matches. I was able to extract these substrings using a simple parser and index them in an FSA using the Aho-Corasick algorithms. The index was then used to quickly eliminate all the regular expressions that trivially don't match a given string, leaving only a few regular expressions to check.

I released the code under the LGPL as a Python/C module. See esmre on Google code hosting.


Martin Sulzmann Has done quite a bit of work in this field. He has a HackageDB project explained breifly here which use partial derivatives seems to be tailor made for this.

The language used is Haskell and thus will be very hard to translate to a non functional language if that is the desire (I would think translation to many other FP languages would still be quite hard).

The code is not based on converting to a series of automata and then combining them, instead it is based on symbolic manipulation of the regexes themselves.

Also the code is very much experimental and Martin is no longer a professor but is in 'gainful employment'(1) so may be uninterested/unable to supply any help or input.


  1. this is a joke - I like professors, the less the smart ones try to work the more chance I have of getting paid!