Parsing XML file containing HTML entities in Java without changing the XML

I have to parse a bunch of XML files in Java that sometimes -- and invalidly -- contain HTML entities such as —, > and so forth. I understand the correct way of dealing with this is to add suitable entity declarations to the XML file before parsing. However, I can't do that as I have no control over those XML files.

Is there some kind of callback I can override that is invoked whenever the Java XML parser encounters such an entity? I haven't been able to find one in the API.

I'd like to use:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();

DocumentBuilder parser = dbf.newDocumentBuilder();
Document        doc    = parser.parse( stream );

I found that I can override resolveEntity in org.xml.sax.helpers.DefaultHandler, but how do I use this with the higher-level API?

Here's a full example:

public class Main {
    public static void main( String [] args ) throws Exception {
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        DocumentBuilder parser = dbf.newDocumentBuilder();
        Document        doc    = parser.parse( new FileInputStream( "test.xml" ));
    }

}

with test.xml:

<?xml version="1.0" encoding="UTF-8"?>
<foo>
    <bar>Some&nbsp;text &mdash; invalid!</bar>
</foo>

Produces:

[Fatal Error] :3:20: The entity "nbsp" was referenced, but not declared.
Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 3; columnNumber: 20; The entity "nbsp" was referenced, but not declared.

Update: I have been poking around in the JDK source code with a debugger, and boy, what an amount of spaghetti. I have no idea what the design is there, or whether there is one. Just how many layers of an onion can one layer on top of each other?

They key class seems to be com.sun.org.apache.xerces.internal.impl.XMLEntityManager, but I cannot find any code that either lets me add stuff into it before it gets used, or that attempts to resolve entities without going through that class.

Solution 1:

I would use a library like Jsoup for this purpose. I tested the following below and it works. I don't know if this helps. It can be located here: http://jsoup.org/download

public static void main(String args[]){


    String html = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><foo>" + 
                  "<bar>Some&nbsp;text &mdash; invalid!</bar></foo>";
    Document doc = Jsoup.parse(html, "", Parser.xmlParser());

    for (Element e : doc.select("bar")) {
        System.out.println(e);
    }   


}

Result:

<bar>
 Some&nbsp;text — invalid!
</bar>

Loading from a file can be found here:

http://jsoup.org/cookbook/input/load-document-from-file

Solution 2:

Issue - 1: I have to parse a bunch of XML files in Java that sometimes -- and invalidly -- contain HTML entities such as —

XML has only five predefined entities. The —,   is not among them. It works only when used in plain HTML or in legacy JSP. So, SAX will not help. It can be done using StaX which has high level iterator based API. (Collected from this link)

Issue - 2: I found that I can override resolveEntity in org.xml.sax.helpers.DefaultHandler, but how do I use this with the higher-level API?

Streaming API for XML, called StaX, is an API for reading and writing XML Documents.

StaX is a Pull-Parsing model. Application can take the control over parsing the XML documents by pulling (taking) the events from the parser.

The core StaX API falls into two categories and they are listed below. They are

Cursor based API: It is low-level API. cursor-based API allows the application to process XML as a stream of tokens aka events
Iterator based API: The higher-level iterator-based API allows the application to process XML as a series of event objects, each of which communicates a piece of the XML structure to the application.

STaX API has support for the notion of not replacing character entity references, by way of the IS_REPLACING_ENTITY_REFERENCES property:

Requires the parser to replace internal entity references with their replacement text and report them as characters

This can be set into an XmlInputFactory, which is then in turn used to construct an XmlEventReader or XmlStreamReader.

However, the API is careful to say that this property is only intended to force the implementation to perform the replacement, rather than forcing it to notreplace them.

You may try it. Hope it will solve your issue. For your case,

Main.java

import java.io.FileInputStream;
import java.io.FileNotFoundException;

import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.EntityReference;
import javax.xml.stream.events.XMLEvent;

public class Main {

    public static void main(String[] args) {
        XMLInputFactory inputFactory = XMLInputFactory.newInstance();
        inputFactory.setProperty(
                XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false);
        XMLEventReader reader;
        try {
            reader = inputFactory
                    .createXMLEventReader(new FileInputStream("F://test.xml"));
            while (reader.hasNext()) {
                XMLEvent event = reader.nextEvent();
                if (event.isEntityReference()) {
                    EntityReference ref = (EntityReference) event;
                    System.out.println("Entity Reference: " + ref.getName());
                }
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (XMLStreamException e) {
            e.printStackTrace();
        }
    }
}

test.xml:

<?xml version="1.0" encoding="UTF-8"?>
<foo>
    <bar>Some&nbsp;text &mdash; invalid!</bar>
</foo>

Output:

Entity Reference: nbsp

Entity Reference: mdash

Credit goes to @skaffman.

Related Link:

http://www.journaldev.com/1191/how-to-read-xml-file-in-java-using-java-stax-api
http://www.journaldev.com/1226/java-stax-cursor-based-api-read-xml-example
http://www.vogella.com/tutorials/JavaXML/article.html
Is there a Java XML API that can parse a document without resolving character entities?

UPDATE:

Issue - 3: Is there a way to use StaX to "filter" the entities (replacing them with something else, for example) and still produce a Document at the end of the process?

To create a new document using the StAX API, it is required to create an XMLStreamWriter that provides methods to produce XML opening and closing tags, attributes and character content.

There are 5 methods of XMLStreamWriter for document.

xmlsw.writeStartDocument(); - initialises an empty document to which elements can be added
xmlsw.writeStartElement(String s) -creates a new element named s
xmlsw.writeAttribute(String name, String value)- adds the attribute name with the corresponding value to the last element produced by a call to writeStartElement. It is possible to add attributes as long as no call to writeElementStart,writeCharacters or writeEndElement has been done.
xmlsw.writeEndElement - close the last started element
xmlsw.writeCharacters(String s) - creates a new text node with content s as content of the last started element.

A sample example is attached with it:

StAXExpand.java

import  java.io.BufferedReader;
import  java.io.FileReader;
import  java.io.IOException;

import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamWriter;

import java.util.Arrays;

public class StAXExpand {   
    static XMLStreamWriter xmlsw = null;
    public static void main(String[] argv) {
        try {
            xmlsw = XMLOutputFactory.newInstance()
                          .createXMLStreamWriter(System.out);
            CompactTokenizer tok = new CompactTokenizer(
                          new FileReader(argv[0]));

            String rootName = "dummyRoot";
            // ignore everything preceding the word before the first "["
            while(!tok.nextToken().equals("[")){
                rootName=tok.getToken();
            }
            // start creating new document
            xmlsw.writeStartDocument();
            ignorableSpacing(0);
            xmlsw.writeStartElement(rootName);
            expand(tok,3);
            ignorableSpacing(0);
            xmlsw.writeEndDocument();

            xmlsw.flush();
            xmlsw.close();
        } catch (XMLStreamException e){
            System.out.println(e.getMessage());
        } catch (IOException ex) {
            System.out.println("IOException"+ex);
            ex.printStackTrace();
        }
    }

    public static void expand(CompactTokenizer tok, int indent) 
        throws IOException,XMLStreamException {
        tok.skip("["); 
        while(tok.getToken().equals("@")) {// add attributes
            String attName = tok.nextToken();
            tok.nextToken();
            xmlsw.writeAttribute(attName,tok.skip("["));
            tok.nextToken();
            tok.skip("]");
        }
        boolean lastWasElement=true; // for controlling the output of newlines 
        while(!tok.getToken().equals("]")){ // process content 
            String s = tok.getToken().trim();
            tok.nextToken();
            if(tok.getToken().equals("[")){
                if(lastWasElement)ignorableSpacing(indent);
                xmlsw.writeStartElement(s);
                expand(tok,indent+3);
                lastWasElement=true;
            } else {
                xmlsw.writeCharacters(s);
                lastWasElement=false;
            }
        }
        tok.skip("]");
        if(lastWasElement)ignorableSpacing(indent-3);
        xmlsw.writeEndElement();
   }

    private static char[] blanks = "\n".toCharArray();
    private static void ignorableSpacing(int nb) 
        throws XMLStreamException {
        if(nb>blanks.length){// extend the length of space array 
            blanks = new char[nb+1];
            blanks[0]='\n';
            Arrays.fill(blanks,1,blanks.length,' ');
        }
        xmlsw.writeCharacters(blanks, 0, nb+1);
    }

}

CompactTokenizer.java

import  java.io.Reader;
import  java.io.IOException;
import  java.io.StreamTokenizer;

public class CompactTokenizer {
    private StreamTokenizer st;

    CompactTokenizer(Reader r){
        st = new StreamTokenizer(r);
        st.resetSyntax(); // remove parsing of numbers...
        st.wordChars('\u0000','\u00FF'); // everything is part of a word
                                         // except the following...
        st.ordinaryChar('\n');
        st.ordinaryChar('[');
        st.ordinaryChar(']');
        st.ordinaryChar('@');
    }

    public String nextToken() throws IOException{
        st.nextToken();
        while(st.ttype=='\n'|| 
              (st.ttype==StreamTokenizer.TT_WORD && 
               st.sval.trim().length()==0))
            st.nextToken();
        return getToken();
    }

    public String getToken(){
        return (st.ttype == StreamTokenizer.TT_WORD) ? st.sval : (""+(char)st.ttype);
    }

    public String skip(String sym) throws IOException {
        if(getToken().equals(sym))
            return nextToken();
        else
            throw new IllegalArgumentException("skip: "+sym+" expected but"+ 
                                               sym +" found ");
    }
}

For more, you can follow the tutorial

https://docs.oracle.com/javase/tutorial/jaxp/stax/example.html
http://www.ibm.com/developerworks/library/x-tipstx2/index.html
http://www.iro.umontreal.ca/~lapalme/ForestInsteadOfTheTrees/HTML/ch09s03.html
http://staf.sourceforge.net/current/STAXDoc.pdf