Resources for lexing, tokenising and parsing in python
Solution 1:
I'm a happy user of PLY. It is a pure-Python implementation of Lex & Yacc, with lots of small niceties that make it quite Pythonic and easy to use. Since Lex & Yacc are the most popular lexing & parsing tools and are used for the most projects, PLY has the advantage of standing on giants' shoulders. A lot of knowledge exists online on Lex & Yacc, and you can freely apply it to PLY.
PLY also has a good documentation page with some simple examples to get you started.
For a listing of lots of Python parsing tools, see this.
Solution 2:
This question is pretty old, but maybe my answer would help someone who wants to learn the basics. I find this resource to be very good. It is a simple interpreter written in python without the use of any external libraries. So this will help anyone who would like to understand the internal working of parsing, lexing, and tokenising:
"A Simple Intepreter from Scratch in Python:" Part 1, Part 2, Part 3, and Part 4.
Solution 3:
For medium-complex grammars, PyParsing is brilliant. You can define grammars directly within Python code, no need for code generation:
>>> from pyparsing import Word, alphas
>>> greet = Word( alphas ) + "," + Word( alphas ) + "!" # <-- grammar defined here
>>> hello = "Hello, World!"
>>>> print hello, "->", greet.parseString( hello )
Hello, World! -> ['Hello', ',', 'World', '!']
(Example taken from the PyParsing home page).
With parse actions (functions that are invoked when a certain grammar rule is triggered), you can convert parses directly into abstract syntax trees, or any other representation.
There are many helper functions that encapsulate recurring patterns, like operator hierarchies, quoted strings, nesting or C-style comments.
Solution 4:
pygments is a source code syntax highlighter written in python. It has lexers and formatters, and may be interesting to peek at the source.
Solution 5:
Here's a few things to get you started (roughly from simplest-to-most-complex, least-to-most-powerful):
http://en.wikipedia.org/wiki/Recursive_descent_parser
http://en.wikipedia.org/wiki/Top-down_parsing
http://en.wikipedia.org/wiki/LL_parser
http://effbot.org/zone/simple-top-down-parsing.htm
http://en.wikipedia.org/wiki/Bottom-up_parsing
http://en.wikipedia.org/wiki/LR_parser
http://en.wikipedia.org/wiki/GLR_parser
When I learned this stuff, it was in a semester-long 400-level university course. We did a number of assignments where we did parsing by hand; if you want to really understand what's going on under the hood, I'd recommend the same approach.
This isn't the book I used, but it's pretty good: Principles of Compiler Design.
Hopefully that's enough to get you started :)