how to unescape XML in java
I need to unescape a xml string containing escaped XML tags:
<
>
&
etc...
I did find some libs that can perform this task, but i'd rather use a single method that can perform this task.
Can someone help?
cheers, Bas Hendriks
Solution 1:
StringEscapeUtils.unescapeXml(xml)
(commons-lang, download)
Solution 2:
Here's a simple method to unescape XML. It handles the predefined XML entities and decimal numerical entities (&#nnnn;). Modifying it to handle hex entities (&#xhhhh;) should be simple.
public static String unescapeXML( final String xml )
{
Pattern xmlEntityRegex = Pattern.compile( "&(#?)([^;]+);" );
//Unfortunately, Matcher requires a StringBuffer instead of a StringBuilder
StringBuffer unescapedOutput = new StringBuffer( xml.length() );
Matcher m = xmlEntityRegex.matcher( xml );
Map<String,String> builtinEntities = null;
String entity;
String hashmark;
String ent;
int code;
while ( m.find() ) {
ent = m.group(2);
hashmark = m.group(1);
if ( (hashmark != null) && (hashmark.length() > 0) ) {
code = Integer.parseInt( ent );
entity = Character.toString( (char) code );
} else {
//must be a non-numerical entity
if ( builtinEntities == null ) {
builtinEntities = buildBuiltinXMLEntityMap();
}
entity = builtinEntities.get( ent );
if ( entity == null ) {
//not a known entity - ignore it
entity = "&" + ent + ';';
}
}
m.appendReplacement( unescapedOutput, entity );
}
m.appendTail( unescapedOutput );
return unescapedOutput.toString();
}
private static Map<String,String> buildBuiltinXMLEntityMap()
{
Map<String,String> entities = new HashMap<String,String>(10);
entities.put( "lt", "<" );
entities.put( "gt", ">" );
entities.put( "amp", "&" );
entities.put( "apos", "'" );
entities.put( "quot", "\"" );
return entities;
}
Solution 3:
Here is one that I wrote in ten minutes. It does not use regular expressions, only simple iterations. I do not think that this can be enhanced to be much faster.
public static String unescape(final String text) {
StringBuilder result = new StringBuilder(text.length());
int i = 0;
int n = text.length();
while (i < n) {
char charAt = text.charAt(i);
if (charAt != '&') {
result.append(charAt);
i++;
} else {
if (text.startsWith("&", i)) {
result.append('&');
i += 5;
} else if (text.startsWith("'", i)) {
result.append('\'');
i += 6;
} else if (text.startsWith(""", i)) {
result.append('"');
i += 6;
} else if (text.startsWith("<", i)) {
result.append('<');
i += 4;
} else if (text.startsWith(">", i)) {
result.append('>');
i += 4;
} else i++;
}
}
return result.toString();
}