Is there an easier way to parse XML in Java?

I'm trying to figure out how to parse some XML (for an Android app), and it seems pretty ridiculous how difficult it is to do in Java. It seems like it requires creating an XML handler which has various callbacks (startElement, endElement, and so on), and you have to then take care of changing all this data into objects. Something like this tutorial.

All I really need is to change an XML document into a multidimensional array, and even better would be to have some sort of Hpricot processor. Is there any way to do this, or do I really have to write all the extra code in the example above?


There are two different types of processors for XML in Java (3 actually, but one is weird). What you have is a SAX parser and what you want is a DOM parser. Take a look at http://www.mkyong.com/java/how-to-read-xml-file-in-java-dom-parser/ for how to use the DOM parser. DOM will create a tree which you can navigate pretty easily. SAX is best for large documents but DOM is much easier if slower and much more memory intensive.


Try http://simple.sourceforge.net, its an XML to Java serialization and binding framework, its fully compatible with Android and is very lightweight, 270K and no dependencies.


Kyle,

(Please excuse the self-promotey nature of this post... I've been working on this library for months and it's all open source/Apache 2, so not that self-serving, just trying to help).

I just released a library I'm calling SJXP or "Simple Java XML Parser" http://www.thebuzzmedia.com/software/simple-java-xml-parser-sjxp/

It is a very small/tight (4 classes) abstraction layer that sits on top of any spec-compliant XML Pull Parser.

On Android and non-Android Java platforms, pull parsing is probably one of the most performant (both in speed and low memory overhead) methods of parsing. Unfortunately coding directly against a pull-parser ends up looking a lot like any other XML parsing code (e.g. SAX) -- you have exception handlers, maintaining parser state, error checking, event handling, value parsing, etc.

What SJXP does is allows you to define XPath-like "paths" in a document of the elements or attributes you want the values from, like:

/rss/channel/title

and it will invoke your callback, with the value, when that rule matches. The API is really straight forward and has intuitive support for namespace-qualified elements if that is what you are trying to parse.

The code for a standard parser would look something like this (an example that parses an RSS2 feed title):

IRule titleRule = new DefaultRule(Type.CHARACTER, "/rss/channel/title") {
@Override
public void handleParsedCharacters(XMLParser parser, String text) {
    // Store the title in a DB or something fancy
}}

then you just create an XMLParser instance and give it all the rules you want it to care about:

XMLParser parser = new XMLParser(titleRule);
parser.parse(xmlStream);

And that's it, the parser will invoke the handler method every time the rule matches. You can stop parsing at any time by calling parser.stop() if you want.

Additionally (and this is the real win of this library) matching namespace qualified elements and attributes is dead easy, you just add their namespace URI inside of brackets prefixing the name of the element in your path.

An example, say you want out of the element for an RSS feed so you can tell what language it is in (ref: http://web.resource.org/rss/1.0/modules/dc/). You just use the unique namespace URI for that 'language' element with the 'dc' prefix, and the rule path ends up looking like this:

/rss/channel/[http://purl.org/dc/elements/1.1/]language

The same goes for namespace-qualified attributes as well.

With all that ease, the only overhead you add to the parsing process is an O(1) hash lookup at each location of the XML document and a few-hundred bytes, maybe 1k, for the internal location state of the parser.

The library works on Android with no additional dependencies (because the platform provides an org.xmlpull impl already) and in any other Java runtime by adding the XPP3 dependency.

This library is the result of many months of writing custom pull parsers for every kind of feed XML out there in every language and realizing (over time) that about 90% of parsing can be distilled down into this really basic paradigm.

I hope you find it handy.


Starting w/ Java 5, there is an XPath library in the SDK. See this tutorial for an introduction to it.