Writing a parser for regular expressions

Solution 1:

Writing an implementation of a regular expression engine is indeed a quite complex task.

But if you are interested in how to do it, even if you can't understand enough of the details to actually implement it, I would recommend that you at least look at this article:

Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)

It explains how many of the popular programming languages implement regular expressions in a way that can be very slow for some regular expressions, and explains a slightly different method that is faster. The article includes some details of how the proposed implementation works, including some source code in C. It may be a bit heavy reading if you are just starting to learn regular expressions, but I think it is well worth knowing about the difference between the two approaches.

Solution 2:

I've already given a +1 to Mark Byers - but as far as I remember the paper doesn't really say that much about how regular expression matching works beyond explaining why one algorithm is bad and another much better. Maybe something in the links?

I'll focus on the good approach - creating finite automata. If you limit yourself to deterministic automata with no minimisation, this isn't really too difficult.

What I'll (very quickly) describe is the approach taken in Modern Compiler Design.

Imagine you have the following regular expression...

a (b c)* d

The letters represent literal characters to match. The * is the usual zero-or-more repetitions match. The basic idea is to derive states based on dotted rules. State zero we'll take as the state where nothing has been matched yet, so the dot goes at the front...

0 : .a (b c)* d

The only possible match is 'a', so the next state we derive is...

1 : a.(b c)* d

We now have two possibilities - match the 'b' (if there's at least one repeat of 'b c') or match the 'd' otherwise. Note - we are basically doing a digraph search here (either depth first or breadth first or whatever) but we are discovering the digraph as we search it. Assuming a breadth-first strategy, we'll need to queue one of our cases for later consideration, but I'll ignore that issue from here on. Anyway, we've discovered two new states...

2 : a (b.c)* d
3 : a (b c)* d.

State 3 is an end state (there may be more than one). For state 2, we can only match the 'c', but we need to be careful with the dot position afterwards. We get "a.(b c)* d" - which is the same as state 1, so we don't need a new state.

IIRC, the approach in Modern Compiler Design is to translate a rule when you hit an operator, in order to simplify the handling of the dot. State 1 would be transformed to...

1 : a.b c (b c)* d
    a.d

That is, your next option is either to match the first repetition or to skip the repetition. The next states from this are equivalent to states 2 and 3. An advantage of this approach is that you can discard all your past matches (everything before the '.') as you only care about future matches. This typically gives a smaller state model (but not necessarily a minimal one).

EDIT If you do discard already matched details, your state description is a representation of the set of strings that can occur from this point on.

In terms of abstract algebra, this is a kind of set closure. An algebra is basically a set with one (or more) operators. Our set is of state descriptions, and our operators are our transitions (character matches). A closed set is one where applying any operator to any members in the set always produces another member that is in the set. The closure of a set is the mimimal larger set that is closed. So basically, starting with the obvious start state, we are constructing the minimal set of states that is closed relative to our set of transition operators - the minimal set of reachable states.

Minimal here refers to the closure process - there may be a smaller equivalent automata which is normally referred to as minimal.

With this basic idea in mind, it's not too difficult to say "if I have two state machines representing two sets of strings, how to I derive a third representing the union" (or intersection, or set difference...). Instead of dotted rules, your state representations will a current state (or set of current states) from each input automaton and perhaps additional details.

If your regular grammars are getting complex, you can minimise. The basic idea here is relatively simple. You group all your states into one equivalence class or "block". Then you repeatedly test whether you need to split blocks (the states aren't really equivalent) with respect to a particular transition type. If all states in a particular block can accept a match of the same character and, in doing so, reach the same next-block, they are equivalent.

Hopcrofts algorithm is an efficient way to handle this basic idea.

A particularly interesting thing about minimisation is that every deterministic finite automaton has precisely one minimal form. Furthermore, Hopcrofts algorithm will produce the same representation of that minimal form, no matter what representation of what larger case it started from. That is, this is a "canonical" representation which can be used to derive a hash or for arbitrary-but-consistent orderings. What this means is that you can use minimal automata as keys into containers.

The above is probably a bit sloppy WRT definitions, so make sure you look up any terms yourself before using them yourself, but with a bit of luck this gives a fair quick introduction to the basic ideas.

BTW - have a look around the rest of Dick Grunes site - he has a free PDF book on parsing techniques. The first edition of Modern Compiler Design is pretty good IMO, but as you'll see, there's a second edition imminent.

Writing a parser for regular expressions

Solution 1:

Solution 2:

Related

Recent Posts