parsing HTML on the iPhone [closed]
Can anyone recommend a C or Objective-C library for HTML parsing? It needs to handle messy HTML code that won't quite validate.
Does such a library exist, or am I better off just trying to use regular expressions?
I found using hpple quite useful to parse messy HTML. Hpple project is a Objective-C wrapper on the XPathQuery library for parsing HTML. Using it you can send an XPath query and receive the result .
Requirements:
-Add libxml2 includes to your project
- Menu Project->Edit Project Settings
- Search for setting "Header Search Paths"
- Add a new search path "${SDKROOT}/usr/include/libxml2"
- Enable recursive option
-Add libxml2 library to to your project
- Menu Project->Edit Project Settings
- Search for setting "Other Linker Flags"
- Add a new search flag "-lxml2"
-From hpple get the following source code files an add them to your project:
- TFpple.h
- TFpple.m
- TFppleElement.h
- TFppleElement.m
- XPathQuery.h
- XPathQuery.m
-Take a walk on w3school XPath Tutorial to feel comfortable with the XPath language.
Code Example
#import "TFHpple.h"
NSData *data = [[NSData alloc] initWithContentsOfFile:@"example.html"];
// Create parser
xpathParser = [[TFHpple alloc] initWithHTMLData:data];
//Get all the cells of the 2nd row of the 3rd table
NSArray *elements = [xpathParser searchWithXPathQuery:@"//table[3]/tr[2]/td"];
// Access the first cell
TFHppleElement *element = [elements objectAtIndex:0];
// Get the text within the cell tag
NSString *content = [element content];
[xpathParser release];
[data release];
Known issues
As hpple is a wrapper over XPathQuery which is another wrapper, this option probably is not the most efficient. If performance is an issue in your project, I recommend to code your own lightweight solution based on hpple and xpathquery library code.
Looks like libxml2.2
comes in the SDK, and libxml/HTMLparser.h
claims the following:
This module implements an HTML 4.0 non-verifying parser with API compatible with the XML parser ones. It should be able to parse "real world" HTML, even if severely broken from a specification point of view.
That sounds like what I need, so I'm probably going to use that.
Just in case anyone has got here by googling for a nice XPath parser and gone off and used TFHpple, Note that TFHpple uses XPathQuery. This is pretty good, but has a memory leak.
In the function *PerformXPathQuery, if the nodes are found to be nil, it jumps out before cleaning up.
So where you see this bit of code: Add in the two cleanup lines.
xmlNodeSetPtr nodes = xpathObj->nodesetval;
if (!nodes)
{
NSLog(@"Nodes was nil.");
/* Cleanup */
xmlXPathFreeObject(xpathObj);
xmlXPathFreeContext(xpathCtx);
return nil;
}
If you are doing a LOT of parsing, it's a vicious leak. Now.... how do I get my night back :-)
I wrote a lightweight wrapper around libxml which maybe useful:
Objective-C-HMTL-Parser