How to use regular expressions to parse HTML in Java?

Solution 1:

Using regular expressions to pull values from HTML is always a mistake. HTML syntax is a lot more complex that it may first appear and it's very easy for a page to catch out even a very complex regular expression.

Use an HTML Parser instead. See also What are the pros and cons of the leading Java HTML parsers?

Solution 2:

The other answers are true. Java Regex API is not a proper tool to achieve your goal. Use efficient, secure and well tested high-level tools mentioned in the other answers.

If your question concerns rather Regex API than a real-life problem (learning purposes for example) - you can do it with the following code:

String html = "foo <a href='link1'>bar</a> baz <a href='link2'>qux</a> foo";
Pattern p = Pattern.compile("<a href='(.*?)'>");
Matcher m = p.matcher(html);
while(m.find()) {
   System.out.println(m.group(0));
   System.out.println(m.group(1));
}

And the output is:

<a href='link1'>
link1
<a href='link2'>
link2

Please note that lazy/reluctant qualifier *? must be used in order to reduce the grouping to the single tag. Group 0 is the entire match, group 1 is the next group match (next pair of parenthesis).

Solution 3:

Dont use regular expressions use NekoHTML or TagSoup which are a bridge providing a SAX or DOM as in XML approach to visiting a HTML document.

Solution 4:

If you want to go down the html parsing route, which Dave and I recommend here's the code to parse a String Data for anchor tags and print their href.

since your just using anchor tags you should be ok with just regex but if you want to do more go with a parser. The Mozilla HTML Parser is the best out there.

File parserLibraryFile = new File("lib/MozillaHtmlParser/native/bin/MozillaParser" + EnviromentController.getSharedLibraryExtension());
                String parserLibrary = parserLibraryFile.getAbsolutePath();
                //  mozilla.dist.bin directory :
                final File mozillaDistBinDirectory = new File("lib/MozillaHtmlParser/mozilla.dist.bin."+ EnviromentController.getOperatingSystemName());

        MozillaParser.init(parserLibrary,mozillaDistBinDirectory.getAbsolutePath());
MozillaParser parser = new MozillaParser();
Document domDocument = parser.parse(data);
NodeList list = domDocument.getElementsByTagName("a");

for (int i = 0; i < list.getLength(); i++) {
    Node n = list.item(i);
    NamedNodeMap m = n.getAttributes();
    if (m != null) {
        Node attrNode = m.getNamedItem("href");
        if (attrNode != null)
           System.out.println(attrNode.getNodeValue());

How to use regular expressions to parse HTML in Java?

Solution 1:

Solution 2:

Solution 3:

Solution 4:

Related

Recent Posts