How do I parse a HTML page with Node.js
Solution 1:
You can use the npm modules jsdom and htmlparser to create and parse a DOM in Node.JS.
Other options include:
- BeautifulSoup for python
- you can convert you html to xhtml and use XSLT
- HTMLAgilityPack for .NET
- CsQuery for .NET (my new favorite)
- The spidermonkey and rhino JS engines have native E4X support. This may be useful, only if you convert your html to xhtml.
Out of all these options, I prefer using the Node.js option, because it uses the standard W3C DOM accessor methods and I can reuse code on both the client and server. I wish BeautifulSoup's methods were more similar to the W3C dom, and I think converting your HTML to XHTML to write XSLT is just plain sadistic.
Solution 2:
Use Cheerio. It isn't as strict as jsdom and is optimized for scraping. As a bonus, uses the jQuery selectors you already know.
❤ Familiar syntax: Cheerio implements a subset of core jQuery. Cheerio removes all the DOM inconsistencies and browser cruft from the jQuery library, revealing its truly gorgeous API.
ϟ Blazingly fast: Cheerio works with a very simple, consistent DOM model. As a result parsing, manipulating, and rendering are incredibly efficient. Preliminary end-to-end benchmarks suggest that cheerio is about 8x faster than JSDOM.
❁ Insanely flexible: Cheerio wraps around @FB55's forgiving htmlparser. Cheerio can parse nearly any HTML or XML document.