lexers vs parsers
Are lexers and parsers really that different in theory?
It seems fashionable to hate regular expressions: coding horror, another blog post.
However, popular lexing based tools: pygments, geshi, or prettify, all use regular expressions. They seem to lex anything...
When is lexing enough, when do you need EBNF?
Has anyone used the tokens produced by these lexers with bison or antlr parser generators?
What parsers and lexers have in common:
-
They read symbols of some alphabet from their input.
- Hint: The alphabet doesn't necessarily have to be of letters. But it has to be of symbols which are atomic for the language understood by parser/lexer.
- Symbols for the lexer: ASCII characters.
- Symbols for the parser: the particular tokens, which are terminal symbols of their grammar.
-
They analyse these symbols and try to match them with the grammar of the language they understood.
- Here's where the real difference usually lies. See below for more.
- Grammar understood by lexers: regular grammar (Chomsky's level 3).
- Grammar understood by parsers: context-free grammar (Chomsky's level 2).
-
They attach semantics (meaning) to the language pieces they find.
- Lexers attach meaning by classifying lexemes (strings of symbols from the input) as the particular tokens. E.g. All these lexemes:
*
,==
,<=
,^
will be classified as "operator" token by the C/C++ lexer. - Parsers attach meaning by classifying strings of tokens from the input (sentences) as the particular nonterminals and building the parse tree. E.g. all these token strings:
[number][operator][number]
,[id][operator][id]
,[id][operator][number][operator][number]
will be classified as "expression" nonterminal by the C/C++ parser.
- Lexers attach meaning by classifying lexemes (strings of symbols from the input) as the particular tokens. E.g. All these lexemes:
-
They can attach some additional meaning (data) to the recognized elements.
- When a lexer recognizes a character sequence constituting a proper number, it can convert it to its binary value and store with the "number" token.
- Similarly, when a parser recognize an expression, it can compute its value and store with the "expression" node of the syntax tree.
-
They all produce on their output a proper sentences of the language they recognize.
- Lexers produce tokens, which are sentences of the regular language they recognize. Each token can have an inner syntax (though level 3, not level 2), but that doesn't matter for the output data and for the one which reads them.
- Parsers produce syntax trees, which are representations of sentences of the context-free language they recognize. Usually it's only one big tree for the whole document/source file, because the whole document/source file is a proper sentence for them. But there aren't any reasons why parser couldn't produce a series of syntax trees on its output. E.g. it could be a parser which recognizes SGML tags sticked into plain-text. So it'll tokenize the SGML document into a series of tokens:
[TXT][TAG][TAG][TXT][TAG][TXT]...
.
As you can see, parsers and tokenizers have much in common. One parser can be a tokenizer for other parser, which reads its input tokens as symbols from its own alphabet (tokens are simply symbols of some alphabet) in the same way as sentences from one language can be alphabetic symbols of some other, higher-level language. For example, if *
and -
are the symbols of the alphabet M
(as "Morse code symbols"), then you can build a parser which recognizes strings of these dots and lines as letters encoded in the Morse code. The sentences in the language "Morse Code" could be tokens for some other parser, for which these tokens are atomic symbols of its language (e.g. "English Words" language). And these "English Words" could be tokens (symbols of the alphabet) for some higher-level parser which understands "English Sentences" language. And all these languages differ only in the complexity of the grammar. Nothing more.
So what's all about these "Chomsky's grammar levels"? Well, Noam Chomsky classified grammars into four levels depending on their complexity:
-
Level 3: Regular grammars
They use regular expressions, that is, they can consist only of the symbols of alphabet (a
,b
), their concatenations (ab
,aba
,bbb
etd.), or alternatives (e.g.a|b
).
They can be implemented as finite state automata (FSA), like NFA (Nondeterministic Finite Automaton) or better DFA (Deterministic Finite Automaton).
Regular grammars can't handle with nested syntax, e.g. properly nested/matched parentheses(()()(()()))
, nested HTML/BBcode tags, nested blocks etc. It's because state automata to deal with it should have to have infinitely many states to handle infinitely many nesting levels. -
Level 2: Context-free grammars
They can have nested, recursive, self-similar branches in their syntax trees, so they can handle with nested structures well.
They can be implemented as state automaton with stack. This stack is used to represent the nesting level of the syntax. In practice, they're usually implemented as a top-down, recursive-descent parser which uses machine's procedure call stack to track the nesting level, and use recursively called procedures/functions for every non-terminal symbol in their syntax.
But they can't handle with a context-sensitive syntax. E.g. when you have an expressionx+3
and in one context thisx
could be a name of a variable, and in other context it could be a name of a function etc. Level 1: Context-sensitive grammars
Level 0: Unrestricted grammars
Also called recursively enumerable grammars.
Yes, they are very different in theory, and in implementation.
Lexers are used to recognize "words" that make up language elements, because the structure of such words is generally simple. Regular expressions are extremely good at handling this simpler structure, and there are very high-performance regular-expression matching engines used to implement lexers.
Parsers are used to recognize "structure" of a language phrases. Such structure is generally far beyond what "regular expressions" can recognize, so one needs "context sensitive" parsers to extract such structure. Context-sensitive parsers are hard to build, so the engineering compromise is to use "context-free" grammars and add hacks to the parsers ("symbol tables", etc.) to handle the context-sensitive part.
Neither lexing nor parsing technology is likely to go away soon.
They may be unified by deciding to use "parsing" technology to recognize "words", as is currently explored by so-called scannerless GLR parsers. That has a runtime cost, as you are applying more general machinery to what is often a problem that doesn't need it, and usually you pay for that in overhead. Where you have lots of free cycles, that overhead may not matter. If you process a lot of text, then the overhead does matter and classical regular expression parsers will continue to be used.
When is lexing enough, when do you need EBNF?
EBNF really doesn't add much to the power of grammars. It's just a convenience / shortcut notation / "syntactic sugar" over the standard Chomsky's Normal Form (CNF) grammar rules. For example, the EBNF alternative:
S --> A | B
you can achieve in CNF by just listing each alternative production separately:
S --> A // `S` can be `A`,
S --> B // or it can be `B`.
The optional element from EBNF:
S --> X?
you can achieve in CNF by using a nullable production, that is, the one which can be replaced by an empty string (denoted by just empty production here; others use epsilon or lambda or crossed circle):
S --> B // `S` can be `B`,
B --> X // and `B` can be just `X`,
B --> // or it can be empty.
A production in a form like the last one B
above is called "erasure", because it can erase whatever it stands for in other productions (product an empty string instead of something else).
Zero-or-more repetiton from EBNF:
S --> A*
you can obtan by using recursive production, that is, one which embeds itself somewhere in it. It can be done in two ways. First one is left recursion (which usually should be avoided, because Top-Down Recursive Descent parsers cannot parse it):
S --> S A // `S` is just itself ended with `A` (which can be done many times),
S --> // or it can begin with empty-string, which stops the recursion.
Knowing that it generates just an empty string (ultimately) followed by zero or more A
s, the same string (but not the same language!) can be expressed using right-recursion:
S --> A S // `S` can be `A` followed by itself (which can be done many times),
S --> // or it can be just empty-string end, which stops the recursion.
And when it comes to +
for one-or-more repetition from EBNF:
S --> A+
it can be done by factoring out one A
and using *
as before:
S --> A A*
which you can express in CNF as such (I use right recursion here; try to figure out the other one yourself as an exercise):
S --> A S // `S` can be one `A` followed by `S` (which stands for more `A`s),
S --> A // or it could be just one single `A`.
Knowing that, you can now probably recognize a grammar for a regular expression (that is, regular grammar) as one which can be expressed in a single EBNF production consisting only from terminal symbols. More generally, you can recognize regular grammars when you see productions similar to these:
A --> // Empty (nullable) production (AKA erasure).
B --> x // Single terminal symbol.
C --> y D // Simple state change from `C` to `D` when seeing input `y`.
E --> F z // Simple state change from `E` to `F` when seeing input `z`.
G --> G u // Left recursion.
H --> v H // Right recursion.
That is, using only empty strings, terminal symbols, simple non-terminals for substitutions and state changes, and using recursion only to achieve repetition (iteration, which is just linear recursion - the one which doesn't branch tree-like). Nothing more advanced above these, then you're sure it's a regular syntax and you can go with just lexer for that.
But when your syntax uses recursion in a non-trivial way, to produce tree-like, self-similar, nested structures, like the following one:
S --> a S b // `S` can be itself "parenthesized" by `a` and `b` on both sides.
S --> // or it could be (ultimately) empty, which ends recursion.
then you can easily see that this cannot be done with regular expression, because you cannot resolve it into one single EBNF production in any way; you'll end up with substituting for S
indefinitely, which will always add another a
s and b
s on both sides. Lexers (more specifically: Finite State Automata used by lexers) cannot count to arbitrary number (they are finite, remember?), so they don't know how many a
s were there to match them evenly with so many b
s. Grammars like this are called context-free grammars (at the very least), and they require a parser.
Context-free grammars are well-known to parse, so they are widely used for describing programming languages' syntax. But there's more. Sometimes a more general grammar is needed -- when you have more things to count at the same time, independently. For example, when you want to describe a language where one can use round parentheses and square braces interleaved, but they have to be paired up correctly with each other (braces with braces, round with round). This kind of grammar is called context-sensitive. You can recognize it by that it has more than one symbol on the left (before the arrow). For example:
A R B --> A S B
You can think of these additional symbols on the left as a "context" for applying the rule. There could be some preconditions, postconditions etc. For example, the above rule will substitute R
into S
, but only when it's in between A
and B
, leaving those A
and B
themselves unchanged. This kind of syntax is really hard to parse, because it needs a full-blown Turing machine. It's a whole another story, so I'll end here.
To answer the question as asked (without repeating unduly what appears in other answers)
Lexers and parsers are not very different, as suggested by the accepted answer. Both are based on simple language formalisms: regular languages for lexers and, almost always, context-free (CF) languages for parsers. They both are associated with fairly simple computational models, the finite state automaton and the push-down stack automaton. Regular languages are a special case of context-free languages, so that lexers could be produced with the somewhat more complex CF technology. But it is not a good idea for at least two reasons.
A fundamental point in programming is that a system component should be buit with the most appropriate technology, so that it is easy to produce, to understand and to maintain. The technology should not be overkill (using techniques much more complex and costly than needed), nor should it be at the limit of its power, thus requiring technical contortions to achieve the desired goal.
That is why "It seems fashionable to hate regular expressions". Though they can do a lot, they sometimes require very unreadable coding to achieve it, not to mention the fact that various extensions and restrictions in implementation somewhat reduce their theoretical simplicity. Lexers do not usually do that, and are usually a simple, efficient, and appropriate technology to parse token. Using CF parsers for token would be overkill, though it is possible.
Another reason not to use CF formalism for lexers is that it might then be tempting to use the full CF power. But that might raise sructural problems regarding the reading of programs.
Fundamentally, most of the structure of program text, from which meaning is extracted, is a tree structure. It expresses how the parse sentence (program) is generated from syntax rules. Semantics is derived by compositional techniques (homomorphism for the mathematically oriented) from the way syntax rules are composed to build the parse tree. Hence the tree structure is essential. The fact that tokens are identified with a regular set based lexer does not change the situation, because CF composed with regular still gives CF (I am speaking very loosely about regular transducers, that transform a stream of characters into a stream of token).
However, CF composed with CF (via CF transducers ... sorry for the math), does not necessarily give CF, and might makes things more general, but less tractable in practice. So CF is not the appropriate tool for lexers, even though it can be used.
One of the major differences between regular and CF is that regular languages (and transducers) compose very well with almost any formalism in various ways, while CF languages (and transducers) do not, not even with themselves (with a few exceptions).
(Note that regular transducers may have others uses, such as formalization of some syntax error handling techniques.)
BNF is just a specific syntax for presenting CF grammars.
EBNF is a syntactic sugar for BNF, using the facilities of regular notation to give terser version of BNF grammars. It can always be transformed into an equivalent pure BNF.
However, the regular notation is often used in EBNF only to emphasize these parts of the syntax that correspond to the structure of lexical elements, and should be recognized with the lexer, while the rest with be rather presented in straight BNF. But it is not an absolute rule.
To summarize, the simpler structure of token is better analyzed with the simpler technology of regular languages, while the tree oriented structure of the language (of program syntax) is better handled by CF grammars.
I would suggest also looking at AHR's answer.
But this leaves a question open: Why trees?
Trees are a good basis for specifying syntax because
they give a simple structure to the text
there are very convenient for associating semantics with the text on the basis of that structure, with a mathematically well understood technology (compositionality via homomorphisms), as indicated above. It is a fundamental algebraic tool to define the semantics of mathematical formalisms.
Hence it is a good intermediate representation, as shown by the success of Abstract Syntax Trees (AST). Note that AST are often different from parse tree because the parsing technology used by many professionals (Such as LL or LR) applies only to a subset of CF grammars, thus forcing grammatical distorsions which are later corrected in AST. This can be avoided with more general parsing technology (based on dynamic programming) that accepts any CF grammar.
Statement about the fact that programming languages are context-sensitive (CS) rather than CF are arbitrary and disputable.
The problem is that the separation of syntax and semantics is arbitrary. Checking declarations or type agreement may be seen as either part of syntax, or part of semantics. The same would be true of gender and number agreement in natural languages. But there are natural languages where plural agreement depends on the actual semantic meaning of words, so that it does not fit well with syntax.
Many definitions of programming languages in denotational semantics place declarations and type checking in the semantics. So stating as done by Ira Baxter that CF parsers are being hacked to get a context sensitivity required by syntax is at best an arbitrary view of the situation. It may be organized as a hack in some compilers, but it does not have to be.
Also it is not just that CS parsers (in the sense used in other answers here) are hard to build, and less efficient. They are are also inadequate to express perspicuously the kinf of context-sensitivity that might be needed. And they do not naturally produce a syntactic structure (such as parse-trees) that is convenient to derive the semantics of the program, i.e. to generate the compiled code.
There are a number of reasons why the analysis portion of a compiler is normally separated into lexical analysis and parsing ( syntax analysis) phases.
- Simplicity of design is the most important consideration. The separation of lexical and syntactic analysis often allows us to simplify at least one of these tasks. For example, a parser that had to deal with comments and white space as syntactic units would be. Considerably more complex than one that can assume comments and white space have already been removed by the lexical analyzer. If we are designing a new language, separating lexical and syntactic concerns can lead to a cleaner overall language design.
- Compiler efficiency is improved. A separate lexical analyzer allows us to apply specialized techniques that serve only the lexical task, not the job of parsing. In addition, specialized buffering techniques for reading input characters can speed up the compiler significantly.
- Compiler portability is enhanced. Input-device-specific peculiarities can be restricted to the lexical analyzer.
resource___Compilers (2nd Edition) written by- Alfred V. Abo Columbia University Monica S. Lam Stanford University Ravi Sethi Avaya Jeffrey D. Ullman Stanford University