Most reliable split character

We currently use

public const char Separator = ((char)007);

I think this is the beep sound, if i am not mistaken.

Aside from 0x0, which may not be available (because of null-terminated strings, for example), the ASCII control characters between 0x1 and 0x1f are good candidates. The ASCII characters 0x1c-0x1f are even designed for such a thing and have the names File Separator, Group Separator, Record Separator, Unit Separator. However, they are forbidden in transport formats such as XML.

In that case, the characters from the unicode private use code points may be used.

One last option would be to use an escaping strategy, so that the separation character can be entered somehow anyway. However, this complicates the task quite a lot and you cannot use String.Split anymore.

You can safely use whatever character you like as delimiter, if you escape the string so that you know that it doesn't contain that character.

Let's for example choose the character 'a' as delimiter. (I intentionally picked a usual character to show that any character can be used.)

Use the character 'b' as escape code. We replace any occurrence of 'a' with 'b1' and any occurrence of 'b' with 'b2':

private static string Escape(string s) {
   return s.Replace("b", "b2").Replace("a", "b1");
}

Now, the string doesn't contain any 'a' characters, so you can put several of those strings together:

string msg = Escape("banana") + "a" + Escape("aardvark") + "a" + Escape("bark");

The string now looks like this:

b2b1nb1nb1ab1b1rdvb1rkab2b1rk

Now you can split the string on 'a' and get the individual parts:

b2b1nb1nb1
b1b1rdvb1rk
b2b1rk

To decode the parts you do the replacement backwards:

private static string Unescape(string s) {
   return s.Replace("b1", "a").Replace("b2", "b");
}

So splitting the string and unencoding the parts is done like this:

string[] parts = msg.split('a');
for (int i = 0; i < parts.length; i++) {
  parts[i] = Unescape(parts[i]);
}

Or using LINQ:

string[] parts = msg.Split('a').Select<string,string>(Unescape).ToArray();

If you choose a less common character as delimiter, there are of course fewer occurrences that will be escaped. The point is that the method makes sure that the character is safe to use as delimiter without making any assumptions about what characters exists in the data that you want to put in the string.

I usually prefer a '|' symbol as the split character. If you are not sure of what user enters in the text then you can restrict the user from entering some special characters and you can choose from those characters, the split character.

It depends what you're splitting.

In most cases it's best to use split chars that are fairly commonly used, for instance

value, value, value

value|value|value

key=value;key=value;

key:value;key:value;

You can use quoted identifiers nicely with commas:

"value", "value", "value with , inside", "value"

I tend to use , first, then |, then if I can't use either of them I use the section-break char §

Note that you can type any ASCII char with ALT+number (on the numeric keypad only), so § is ALT+21

Most reliable split character

Related

Recent Posts