Can you provide examples of parsing HTML?
How do you parse HTML with a variety of languages and parsing libraries?
When answering:
Individual comments will be linked to in answers to questions about how to parse HTML with regexes as a way of showing the right way to do things.
For the sake of consistency, I ask that the example be parsing an HTML file for the href
in anchor tags. To make it easy to search this question, I ask that you follow this format
Language: [language name]
Library: [library name]
[example code]
Please make the library a link to the documentation for the library. If you want to provide an example other than extracting links, please also include:
Purpose: [what the parse does]
Language: JavaScript
Library: jQuery
$.each($('a[href]'), function(){
console.debug(this.href);
});
(using firebug console.debug for output...)
And loading any html page:
$.get('http://stackoverflow.com/', function(page){
$(page).find('a[href]').each(function(){
console.debug(this.href);
});
});
Used another each function for this one, I think it's cleaner when chaining methods.
Language: C#
Library: HtmlAgilityPack
class Program
{
static void Main(string[] args)
{
var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");
var nodes = doc.DocumentNode.SelectNodes("//a[@href]");
foreach (var node in nodes)
{
Console.WriteLine(node.InnerHtml);
}
}
}