How do you remove invalid hexadecimal characters from an XML-based data source prior to constructing an XmlReader or XPathDocument that uses the data?
It may not be perfect (emphasis added since people missing this disclaimer), but what I've done in that case is below. You can adjust to use with a stream.
/// <summary>
/// Removes control characters and other non-UTF-8 characters
/// </summary>
/// <param name="inString">The string to process</param>
/// <returns>A string with no control characters or entities above 0x00FD</returns>
public static string RemoveTroublesomeCharacters(string inString)
{
if (inString == null) return null;
StringBuilder newString = new StringBuilder();
char ch;
for (int i = 0; i < inString.Length; i++)
{
ch = inString[i];
// remove any characters outside the valid UTF-8 range as well as all control characters
// except tabs and new lines
//if ((ch < 0x00FD && ch > 0x001F) || ch == '\t' || ch == '\n' || ch == '\r')
//if using .NET version prior to 4, use above logic
if (XmlConvert.IsXmlChar(ch)) //this method is new in .NET 4
{
newString.Append(ch);
}
}
return newString.ToString();
}
I like Eugene's whitelist concept. I needed to do a similar thing as the original poster, but I needed to support all Unicode characters, not just up to 0x00FD. The XML spec is:
Char = #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
In .NET, the internal representation of Unicode characters is only 16 bits, so we can't `allow' 0x10000-0x10FFFF explicitly. The XML spec explicitly disallows the surrogate code points starting at 0xD800 from appearing. However it is possible that if we allowed these surrogate code points in our whitelist, utf-8 encoding our string might produce valid XML in the end as long as proper utf-8 encoding was produced from the surrogate pairs of utf-16 characters in the .NET string. I haven't explored this though, so I went with the safer bet and didn't allow the surrogates in my whitelist.
The comments in Eugene's solution are misleading though, the problem is that the characters we are excluding are not valid in XML ... they are perfectly valid Unicode code points. We are not removing `non-utf-8 characters'. We are removing utf-8 characters that may not appear in well-formed XML documents.
public static string XmlCharacterWhitelist( string in_string ) {
if( in_string == null ) return null;
StringBuilder sbOutput = new StringBuilder();
char ch;
for( int i = 0; i < in_string.Length; i++ ) {
ch = in_string[i];
if( ( ch >= 0x0020 && ch <= 0xD7FF ) ||
( ch >= 0xE000 && ch <= 0xFFFD ) ||
ch == 0x0009 ||
ch == 0x000A ||
ch == 0x000D ) {
sbOutput.Append( ch );
}
}
return sbOutput.ToString();
}