Parsing very large XML documents (and a bit more) in java

Solution 1:

Stax is the right way. I would recommend looking at Woodstox

Solution 2:

This sounds like a job for StAX (JSR 173). StAX is a pull parser, which means that it works more or less like an event based parser like SAX, but that you have more control over when to stop reading, which elements to pull, ...

The usability of this solution will depend a lot on what your extension classes are actually doing, if you have control over their implementation, etc...

The main point is that if the document is very large, you probably want to use an event based parser and not a tree based, so you will not use a lot of memory.

Implementations of StAX can be found from SUN (SJSXP), Codehaus or a few other providers.

Solution 3:

You could use a BufferedInputStream with a very large buffer size and use mark() before the extension class works and reset() afterwards.

If the parts the extension class needs is very far into the file, then this might become extremely memory intensive, 'though.

A more general solution would be to write your own BufferedInputStream-workalike that buffers to the disk if the data that is to be buffered exceeds some preset threshold.

Solution 4:

I would write a custom implementation of InputStream that decrypts the bytes in the file and then use SAX to parse the resulting XML as it comes off the stream.

SAXParserFactory.newInstance().newSAXParser().parse(
  new DecryptingInputStream(), 
  new MyHandler()
);