Why can regular expressions have an exponential running time?

It is possible to write a Regex which needs in some cases exponential running time. Such an example is (aa|aa)*. If there is an input of an odd number of as it needs exponential running time.

It is easy to test this. If the input contains only as and has length 51, the Regex needs some seconds to compute (on my machine). Instead if the input length is 52 its computing time is not noticeable (I tested this with the built-in Regex-parser of the JavaRE).

I have written a Regex-parser to find the reason for this behavior, but I didn't find it. My parser can build an AST or a NFA based on a Regex. After that it can translate the NFA to a DFA. To do this it uses the powerset construction algorithm.

When I parse the Rgex mentioned above, the parser creates a NFA with 7 states - after conversion there are only 3 states left in the DFA. The DFA represents the more sensible Regex (aa)*, which can be parsed very fast.

Thus, I don't understand why there are parsers which can be so slow. What is the reason for this? Do they not translate the NFA to a DFA? If yes, why not? And what's the technical reasons why they compute so slow?


Solution 1:

Russ Cox has a very detailed article about why this is and the history of regexes (part 2, part 3).

Regular expression matching can be simple and fast, using finite automata-based techniques that have been known for decades. In contrast, Perl, PCRE, Python, Ruby, Java, and many other languages have regular expression implementations based on recursive backtracking that are simple but can be excruciatingly slow. With the exception of backreferences, the features provided by the slow backtracking implementations can be provided by the automata-based implementations at dramatically faster, more consistent speeds.

Largely, it comes down to proliferation of non-regular features in "regular" expressions such as backreferences, and the (continued) ignorance of most programmers that there are better alternatives for regexes that do not contain such features (which is many of them).

While writing the text editor sam in the early 1980s, Rob Pike wrote a new regular expression implementation, which Dave Presotto extracted into a library that appeared in the Eighth Edition. Pike's implementation incorporated submatch tracking into an efficient NFA simulation but, like the rest of the Eighth Edition source, was not widely distributed. Pike himself did not realize that his technique was anything new. Henry Spencer reimplemented the Eighth Edition library interface from scratch, but using backtracking, and released his implementation into the public domain. It became very widely used, eventually serving as the basis for the slow regular expression implementations mentioned earlier: Perl, PCRE, Python, and so on. (In his defense, Spencer knew the routines could be slow, and he didn't know that a more efficient algorithm existed. He even warned in the documentation, “Many users have found the speed perfectly adequate, although replacing the insides of egrep with this code would be a mistake.”) Pike's regular expression implementation, extended to support Unicode, was made freely available with sam in late 1992, but the particularly efficient regular expression search algorithm went unnoticed.

Solution 2:

Regular expressions conforming to this formal definition are computable in linear time, because they have corresponding finite automatas. They are built only from parentheses, alternative | (sometimes called sum), Kleene star * and concatenation.

Extending regular expressions by adding, for example, backward references can lead even to NP-complete regular expressions. Here you can find an example of regular expression recognizing non-prime numbers.

I guess that, such an extended implementation can have non-linear matching time even in simple cases.

I made a quick experiment in Perl and your regular expression computes equally fast for odd and even number of 'a's.