How to validate HTML from Java?

What is a fast and simple way to validate HTML from Java? I’m looking for an open-source/PD class (or set of classes) that describes the various properties of the 100-odd HTML tags, such as:

  1. Is the tag optional? Empty? Is it legal to omit its closing tag?
  2. Which other tags can this tag contain (if any)?
  3. Which attributes are legal for this tag, and what are their types? (not required, but nice to have)

Thanks!

EDIT

I'm looking to do to a tag-by-tag analysis of an HTML document, so I'm less interested in whether the document as a whole is valid, but rather what the specific requirements are for each type of tag. I could encode the rules based on the W3C spec, but wanted to see which ready-made solutions are available first.


Solution 1:

If you want to verify certain tags follow certain specifications, there seems to be no end of Java based HTML parsers:

Open Source HTML Parsers in Java

In other words, you could parse you HTML, and then inspect the resulting document for the tags you were looking for and determine if they meet the specifications you require. If they don't you could then just throw an error.

I don't think you'll find a HTML analysis tool which was written with exactly your requirements in mind, mostly because those requirements haven't been voiced and are probably a bit nebulous.

If the parser doesn't do what you want out of the box, at least this list is open source, so you can hack the parser as long as you publish your changes.

Solution 2:

Check JTidy (http://jtidy.sourceforge.net/) and VietSpider HTMLParser ( http://sourceforge.net/projects/binhgiang/ ) both are Java HTML parser and some syntax checking capabilities. Some eclipse based HTML editor plugin use JTidy (or port of Tidy) for syntax checking. Or as David Said, submit the page to w3c.org