Using regular expressions to parse HTML: why not?

Entire HTML parsing is not possible with regular expressions, since it depends on matching the opening and the closing tag which is not possible with regexps.

Regular expressions can only match regular languages but HTML is a context-free language and not a regular language (As @StefanPochmann pointed out, regular languages are also context-free, so context-free doesn't necessarily mean not regular). The only thing you can do with regexps on HTML is heuristics but that will not work on every condition. It should be possible to present a HTML file that will be matched wrongly by any regular expression.


For quick´n´dirty regexp will do fine. But the fundamental thing to know is that it is impossible to construct a regexp that will correctly parse HTML.

The reason is that regexps can’t handle arbitarly nested expressions. See Can regular expressions be used to match nested patterns?