What is the best way to parse html in C#? [closed]

I'm looking for a library/method to parse an html file with more html specific features than generic xml parsing libraries.

Solution 1:

Html Agility Pack

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Solution 2:

You could use TidyNet.Tidy to convert the HTML to XHTML, and then use an XML parser.

Another alternative would be to use the builtin engine mshtml:

using mshtml;
object[] oPageText = { html };
HTMLDocument doc = new HTMLDocumentClass();
IHTMLDocument2 doc2 = (IHTMLDocument2)doc;

This allows you to use javascript-like functions like getElementById()

Solution 3:

I found a project called Fizzler that takes a jQuery/Sizzler approach to selecting HTML elements. It's based on HTML Agility Pack. It's currently in beta and only supports a subset of CSS selectors, but it's pretty damn cool and refreshing to use CSS selectors over nasty XPath.


Solution 4:

You can do a lot without going nuts on 3rd-party products and mshtml (i.e. interop). use the System.Windows.Forms.WebBrowser. From there, you can do such things as "GetElementById" on an HtmlDocument or "GetElementsByTagName" on HtmlElements. If you want to actually inteface with the browser (simulate button clicks for example), you can use a little reflection (imo a lesser evil than Interop) to do it:

var wb = new WebBrowser()

... tell the browser to navigate (tangential to this question). Then on the Document_Completed event you can simulate clicks like this.

var doc = wb.Browser.Document
var elem = doc.GetElementById(elementId);
object obj = elem.DomElement;
System.Reflection.MethodInfo mi = obj.GetType().GetMethod("click");
mi.Invoke(obj, new object[0]);

you can do similar reflection stuff to submit forms, etc.
