Removing hidden characters from within strings

My problem:

I have a .NET application that sends out newsletters via email. When the newsletters are viewed in outlook, outlook displays a question mark in place of a hidden character it can’t recognize. These hidden character(s) are coming from end users who copy and paste html that makes up the newsletters into a form and submits it. A c# trim() removes these hidden chars if they occur at the end or beginning of the string. When the newsletter is viewed in gmail, gmail does a good job ignoring them. When pasting these hidden characters in a word document and I turn on the “show paragraph marks and hidden symbols” option the symbols appear as one rectangle inside a bigger rectangle. Also the text that makes up the newsletters can be in any language, so accepting Unicode chars is a must. I've tried looping through the string to detect the character but the loop doesn't recognize it and passes over it. Also asking the end user to paste the html into notepad first before submitting it is out of the question.

My question:
How can I detect and eliminate these hidden characters using C#?


You can remove all control characters from your input string with something like this:

string input; // this is your input string
string output = new string(input.Where(c => !char.IsControl(c)).ToArray());

Here is the documentation for the IsControl() method.

Or if you want to keep letters and digits only, you can also use the IsLetter and IsDigit function:

string output = new string(input.Where(c => char.IsLetter(c) || char.IsDigit(c)).ToArray());

I usually use this regular expression to replace all non-printable characters.

By the way, most of the people think that tab, line feed and carriage return are non-printable characters, but for me they are not.

So here is the expression:

string output = Regex.Replace(input, @"[^\u0009\u000A\u000D\u0020-\u007E]", "*");
  • ^ means if it's any of the following:
  • \u0009 is tab
  • \u000A is linefeed
  • \u000D is carriage return
  • \u0020-\u007E means everything from space to ~ -- that is, everything in ASCII.

See ASCII table if you want to make changes. Remember it would strip off every non-ASCII character.

To test above you can create a string by yourself like this:

    string input = string.Empty;

    for (int i = 0; i < 255; i++)
    {
        input += (char)(i);
    }