XML Processing in Python

I am about to build a piece of a project that will need to construct and post an XML document to a web service and I'd like to do it in Python, as a means to expand my skills in it.

Unfortunately, whilst I know the XML model fairly well in .NET, I'm uncertain what the pros and cons are of the XML models in Python.

Anyone have experience doing XML processing in Python? Where would you suggest I start? The XML files I'll be building will be fairly simple.


Solution 1:

Personally, I've played with several of the built-in options on an XML-heavy project and have settled on pulldom as the best choice for less complex documents.

Especially for small simple stuff, I like the event-driven theory of parsing rather than setting up a whole slew of callbacks for a relatively simple structure. Here is a good quick discussion of how to use the API.

What I like: you can handle the parsing in a for loop rather than using callbacks. You also delay full parsing (the "pull" part) and only get additional detail when you call expandNode(). This satisfies my general requirement for "responsible" efficiency without sacrificing ease of use and simplicity.

Solution 2:

ElementTree has a nice pythony API. I think it's even shipped as part of python 2.5

It's in pure python and as I say, pretty nice, but if you wind up needing more performance, then lxml exposes the same API and uses libxml2 under the hood. You can theoretically just swap it in when you discover you need it.

Solution 3:

There are 3 major ways of dealing with XML, in general: dom, sax, and xpath. The dom model is good if you can afford to load your entire xml file into memory at once, and you don't mind dealing with data structures, and you are looking at much/most of the model. The sax model is great if you only care about a few tags, and/or you are dealing with big files and can process them sequentially. The xpath model is a little bit of each -- you can pick and choose paths to the data elements you need, but it requires more libraries to use.

If you want straightforward and packaged with Python, minidom is your answer, but it's pretty lame, and the documentation is "here's docs on dom, go figure it out". It's really annoying.

Personally, I like cElementTree, which is a faster (c-based) implementation of ElementTree, which is a dom-like model.

I've used sax systems, and in many ways they're more "pythonic" in their feel, but I usually end up creating state-based systems to handle them, and that way lies madness (and bugs).

I say go with minidom if you like research, or ElementTree if you want good code that works well.