Escape invalid XML characters in C#
I have a string that contains invalid XML characters. How can I escape (or remove) invalid XML characters before I parse the string?
Solution 1:
As the way to remove invalid XML characters I suggest you to use XmlConvert.IsXmlChar method. It was added since .NET Framework 4 and is presented in Silverlight too. Here is the small sample:
void Main() {
string content = "\v\f\0";
Console.WriteLine(IsValidXmlString(content)); // False
content = RemoveInvalidXmlChars(content);
Console.WriteLine(IsValidXmlString(content)); // True
}
static string RemoveInvalidXmlChars(string text) {
var validXmlChars = text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray();
return new string(validXmlChars);
}
static bool IsValidXmlString(string text) {
try {
XmlConvert.VerifyXmlChars(text);
return true;
} catch {
return false;
}
}
And as the way to escape invalid XML characters I suggest you to use XmlConvert.EncodeName method. Here is the small sample:
void Main() {
const string content = "\v\f\0";
Console.WriteLine(IsValidXmlString(content)); // False
string encoded = XmlConvert.EncodeName(content);
Console.WriteLine(IsValidXmlString(encoded)); // True
string decoded = XmlConvert.DecodeName(encoded);
Console.WriteLine(content == decoded); // True
}
static bool IsValidXmlString(string text) {
try {
XmlConvert.VerifyXmlChars(text);
return true;
} catch {
return false;
}
}
Update: It should be mentioned that the encoding operation produces a string with a length which is greater or equal than a length of a source string. It might be important when you store a encoded string in a database in a string column with length limitation and validate source string length in your app to fit data column limitation.
Solution 2:
Use SecurityElement.Escape
using System;
using System.Security;
class Sample {
static void Main() {
string text = "Escape characters : < > & \" \'";
string xmlText = SecurityElement.Escape(text);
//output:
//Escape characters : < > & " '
Console.WriteLine(xmlText);
}
}
Solution 3:
If you are writing xml, just use the classes provided by the framework to create the xml. You won't have to bother with escaping or anything.
Console.Write(new XElement("Data", "< > &"));
Will output
<Data>< > &</Data>
If you need to read an XML file that is malformed, do not use regular expression. Instead, use the Html Agility Pack.
Solution 4:
The RemoveInvalidXmlChars method provided by Irishman does not support surrogate characters. To test it, use the following example:
static void Main()
{
const string content = "\v\U00010330";
string newContent = RemoveInvalidXmlChars(content);
Console.WriteLine(newContent);
}
This returns an empty string but it shouldn't! It should return "\U00010330" because the character U+10330 is a valid XML character.
To support surrogate characters, I suggest using the following method:
public static string RemoveInvalidXmlChars(string text)
{
if (string.IsNullOrEmpty(text))
return text;
int length = text.Length;
StringBuilder stringBuilder = new StringBuilder(length);
for (int i = 0; i < length; ++i)
{
if (XmlConvert.IsXmlChar(text[i]))
{
stringBuilder.Append(text[i]);
}
else if (i + 1 < length && XmlConvert.IsXmlSurrogatePair(text[i + 1], text[i]))
{
stringBuilder.Append(text[i]);
stringBuilder.Append(text[i + 1]);
++i;
}
}
return stringBuilder.ToString();
}
Solution 5:
Here is an optimized version of the above method RemoveInvalidXmlChars which doesn't create a new array on every call, thus stressing the GC unnecessarily:
public static string RemoveInvalidXmlChars(string text)
{
if (text == null)
return text;
if (text.Length == 0)
return text;
// a bit complicated, but avoids memory usage if not necessary
StringBuilder result = null;
for (int i = 0; i < text.Length; i++)
{
var ch = text[i];
if (XmlConvert.IsXmlChar(ch))
{
result?.Append(ch);
}
else if (result == null)
{
result = new StringBuilder();
result.Append(text.Substring(0, i));
}
}
if (result == null)
return text; // no invalid xml chars detected - return original text
else
return result.ToString();
}