Most efficient way to remove special characters from string

I want to remove all special characters from a string. Allowed characters are A-Z (uppercase or lowercase), numbers (0-9), underscore (_), or the dot sign (.).

I have the following, it works but I suspect (I know!) it's not very efficient:

    public static string RemoveSpecialCharacters(string str)
    {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < str.Length; i++)
        {
            if ((str[i] >= '0' && str[i] <= '9')
                || (str[i] >= 'A' && str[i] <= 'z'
                    || (str[i] == '.' || str[i] == '_')))
                {
                    sb.Append(str[i]);
                }
        }

        return sb.ToString();
    }

What is the most efficient way to do this? What would a regular expression look like, and how does it compare with normal string manipulation?

The strings that will be cleaned will be rather short, usually between 10 and 30 characters in length.


Solution 1:

Why do you think that your method is not efficient? It's actually one of the most efficient ways that you can do it.

You should of course read the character into a local variable or use an enumerator to reduce the number of array accesses:

public static string RemoveSpecialCharacters(this string str) {
   StringBuilder sb = new StringBuilder();
   foreach (char c in str) {
      if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || c == '.' || c == '_') {
         sb.Append(c);
      }
   }
   return sb.ToString();
}

One thing that makes a method like this efficient is that it scales well. The execution time will be relative to the length of the string. There is no nasty surprises if you would use it on a large string.

Edit:
I made a quick performance test, running each function a million times with a 24 character string. These are the results:

Original function: 54.5 ms.
My suggested change: 47.1 ms.
Mine with setting StringBuilder capacity: 43.3 ms.
Regular expression: 294.4 ms.

Edit 2: I added the distinction between A-Z and a-z in the code above. (I reran the performance test, and there is no noticable difference.)

Edit 3:
I tested the lookup+char[] solution, and it runs in about 13 ms.

The price to pay is, of course, the initialization of the huge lookup table and keeping it in memory. Well, it's not that much data, but it's much for such a trivial function...

private static bool[] _lookup;

static Program() {
   _lookup = new bool[65536];
   for (char c = '0'; c <= '9'; c++) _lookup[c] = true;
   for (char c = 'A'; c <= 'Z'; c++) _lookup[c] = true;
   for (char c = 'a'; c <= 'z'; c++) _lookup[c] = true;
   _lookup['.'] = true;
   _lookup['_'] = true;
}

public static string RemoveSpecialCharacters(string str) {
   char[] buffer = new char[str.Length];
   int index = 0;
   foreach (char c in str) {
      if (_lookup[c]) {
         buffer[index] = c;
         index++;
      }
   }
   return new string(buffer, 0, index);
}

Solution 2:

Well, unless you really need to squeeze the performance out of your function, just go with what is easiest to maintain and understand. A regular expression would look like this:

For additional performance, you can either pre-compile it or just tell it to compile on first call (subsequent calls will be faster.)

public static string RemoveSpecialCharacters(string str)
{
    return Regex.Replace(str, "[^a-zA-Z0-9_.]+", "", RegexOptions.Compiled);
}

Solution 3:

I suggest creating a simple lookup table, which you can initialize in the static constructor to set any combination of characters to valid. This lets you do a quick, single check.

edit

Also, for speed, you'll want to initialize the capacity of your StringBuilder to the length of your input string. This will avoid reallocations. These two methods together will give you both speed and flexibility.

another edit

I think the compiler might optimize it out, but as a matter of style as well as efficiency, I recommend foreach instead of for.