Using StAX to create index for XML for quick access

Solution 1:

You could work with a generated XML parser using ANTLR4.

The Following works very well on a ~17GB Wikipedia dump /20170501/dewiki-20170501-pages-articles-multistream.xml.bz2 but I had to increase heap size using -xX6GB.

1. Get XML Grammar

cd /tmp
git clone https://github.com/antlr/grammars-v4

2. Generate Parser

cd /tmp/grammars-v4/xml/
mvn clean install

3. Copy Generated Java files to your Project

cp -r target/generated-sources/antlr4 /path/to/your/project/gen

4. Hook in with a Listener to collect character offsets

package stack43366566;

import java.util.ArrayList;
import java.util.List;

import org.antlr.v4.runtime.ANTLRFileStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTreeWalker;

import stack43366566.gen.XMLLexer;
import stack43366566.gen.XMLParser;
import stack43366566.gen.XMLParser.DocumentContext;
import stack43366566.gen.XMLParserBaseListener;

public class FindXmlOffset {

    List<Integer> offsets = null;
    String searchForElement = null;

    public class MyXMLListener extends XMLParserBaseListener {
        public void enterElement(XMLParser.ElementContext ctx) {
            String name = ctx.Name().get(0).getText();
            if (searchForElement.equals(name)) {
                offsets.add(ctx.start.getStartIndex());
            }
        }
    }

    public List<Integer> createOffsets(String file, String elementName) {
        searchForElement = elementName;
        offsets = new ArrayList<>();
        try {
            XMLLexer lexer = new XMLLexer(new ANTLRFileStream(file));
            CommonTokenStream tokens = new CommonTokenStream(lexer);
            XMLParser parser = new XMLParser(tokens);
            DocumentContext ctx = parser.document();
            ParseTreeWalker walker = new ParseTreeWalker();
            MyXMLListener listener = new MyXMLListener();
            walker.walk(listener, ctx);
            return offsets;
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    public static void main(String[] arg) {
        System.out.println("Search for offsets.");
        List<Integer> offsets = new FindXmlOffset().createOffsets("/tmp/dewiki-20170501-pages-articles-multistream.xml",
                        "page");
        System.out.println("Offsets: " + offsets);
    }

}

5. Result

Prints:

Offsets: [2441, 10854, 30257, 51419 ....

6. Read from Offset Position

To test the code I've written class that reads in each wikipedia page to a java object

@JacksonXmlRootElement
class Page {
   public Page(){};
   public String title;
}

using basically this code

private Page readPage(Integer offset, String filename) {
        try (Reader in = new FileReader(filename)) {
            in.skip(offset);
            ObjectMapper mapper = new XmlMapper();
             mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
            Page object = mapper.readValue(in, Page.class);
            return object;
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

Find complete example on github.

Solution 2:

I just had to solve this problem, and spent way too much time figuring it out. Hopefully the next poor soul who comes looking for ideas can benefit from my suffering.

The first problem to contend with is that most XMLStreamReader implementations provide inaccurate results when you ask them for their current offsets. Woodstox however seems to be rock-solid in this regard.

The second problem is the actual type of offset you use. You have to use char offsets if you need to work with a multi-byte charset, which means the random-access retrieval from the file using the provided offsets is not going to be very efficient - you can't just set a pointer into the file at your offset and start reading, you have to read through until you get to the offset (that's what skip does under the covers in a Reader), then start extracting. If you're dealing with very large files, that means retrieval of content near the end of the file is too slow.

I ended up writing a FilterReader that keeps a buffer of byte offset to char offset mappings as the file is read. When we need to get the byte offset, we first ask Woodstox for the char offset, then get the custom reader to tell us the actual byte offset for the char offset. We can get the byte offset from the beginning and end of the element, giving us what we need to go in and surgically extract the element from the file by opening it as a RandomAccessFile, which means it's super fast at any point in the file.

I created a library for this, it's on GitHub and Maven Central. If you just want to get the important bits, the party trick is in the ByteTrackingReader.

Some people have commented about how this whole thing is a bad idea and why would you want to do it? XML is a transport mechanism, you should just import it to a DB and work with the data with more appropriate tools. For most cases this is true, but if you're building applications or integrations that communicate via XML, you need tooling to analyze and operate on the files that are exchanged. I get daily requests to verify feed contents, having the ability to quickly extract a specific set of items from a massive file and verify not only the contents, but the format itself is essential.

Anyhow, hopefully this can save someone a few hours, or at least get them closer to a solution.