Short example of regular expression converted to a state machine?

Solution 1:

Sure, although you'll need more complicated examples to truly understand how REs work. Consider the following RE:

^[A-Za-z][A-Za-z0-9_]*$

which is a typical identifier (must start with alpha and can have any number of alphanumeric and undescore characters following, including none). The following pseudo-code shows how this can be done with a finite state machine:

state = FIRSTCHAR
for char in all_chars_in(string):
    if state == FIRSTCHAR:
            if char is not in the set "A-Z" or "a-z":
                error "Invalid first character"
            state = SUBSEQUENTCHARS
            next char
    if state == SUBSEQUENTCHARS:
            if char is not in the set "A-Z" or "a-z" or "0-9" or "_":
                error "Invalid subsequent character"
            state = SUBSEQUENTCHARS
            next char

Now, as I said, this is a very simple example. It doesn't show how to do greedy/nongreedy matches, backtracking, matching within a line (instead of the whole line) and other more esoteric features of state machines that are easily handled by the RE syntax.

That's why REs are so powerful. The actual finite state machine code required to do what a one-liner RE can do is usually very long and complex.

The best thing you could do is grab a copy of some lex/yacc (or equivalent) code for a specific simple language and see the code it generates. It's not pretty (it doesn't have to be since it's not supposed to be read by humans, they're supposed to be looking at the lex/yacc code) but it may give you a better idea as to how they work.

Solution 2:

A rather convenient way to help look at this to use python's little-known re.DEBUG flag on any pattern:

>>> re.compile(r'<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>', re.DEBUG)
literal 60
subpattern 1
  in
    range (65, 90)
  max_repeat 0 65535
    in
      range (65, 90)
      range (48, 57)
at at_boundary
max_repeat 0 65535
  not_literal 62
literal 62
subpattern 2
  min_repeat 0 65535
    any None
literal 60
literal 47
groupref 1
literal 62

The numbers after 'literal' and 'range' refer to the integer values of the ascii characters they're supposed to match.

Solution 3:

Make your own on the fly!

http://osteele.com/tools/reanimator/???

A finite state machine

This is a really nicely put together tool which visualises regular expressions as FSMs. It doesn't have support for some of the syntax you'll find in real-world regular expression engines, but certainly enough to understand exactly what's going on.