C# regex to remove non - printable characters, and control characters, in a text that has a mix of many different languages, unicode letters
You may remove all control and other non-printable characters with
s = Regex.Replace(s, @"\p{C}+", string.Empty);
The \p{C}
Unicode category class matches all control characters, even those outside the ASCII table because in .NET, Unicode category classes are Unicode-aware by default.
Breaking it down into subcategories
- To only match basic control characters you may use
\p{Cc}+
, see 65 chars in the Other, Control Unicode category. It is equal to a[\u0000-\u0008\u000E-\u001F\u007F-\u0084\u0086-\u009F \u0009-\u000D \u0085]+
regex. - To only match 161 other format chars including the well-known soft hyphen (
\u00AD
), zero-width space (\u200B
), zero-width non-joiner (\u200C
), zero-width joiner (\u200D
), left-to-right mark (\u200E
) and right-to-left mark (\u200F
) use\p{Cf}+
. The equivalent including astral place code points is a(?:[\xAD\u0600-\u0605\u061C\u06DD\u070F\u08E2\u180E\u200B-\u200F\u202A-\u202E\u2060-\u2064\u2066-\u206F\uFEFF\uFFF9-\uFFFB]|\uD804[\uDCBD\uDCCD]|\uD80D[\uDC30-\uDC38]|\uD82F[\uDCA0-\uDCA3]|\uD834[\uDD73-\uDD7A]|\uDB40[\uDC01\uDC20-\uDC7F])+
regex. - To match 137,468 Other, Private Use control code points you may use
\p{Co}+
, or its equivalent including astral place code points,(?:[\uE000-\uF8FF]|[\uDB80-\uDBBE\uDBC0-\uDBFE][\uDC00-\uDFFF]|[\uDBBF\uDBFF][\uDC00-\uDFFD])+
. - To match 2,048 Other, Surrogate code points that include some emojis, you may use
\p{Cs}+
, or[\uD800-\uDFFF]+
regex.
You can try with :
string s = "Täkörgåsmrgås";
s = Regex.Replace(s, @"[^\u0000-\u007F]+", string.Empty);
Updated answer after comments:
Documentation about non-printable character: https://en.wikipedia.org/wiki/Control_character
Char.IsControl Method:
https://msdn.microsoft.com/en-us/library/system.char.iscontrol.aspx
Maybe you can try:
string input; // this is your input string
string output = new string(input.Where(c => !char.IsControl(c)).ToArray());
To remove all control and other non-printable characters
Regex.Replace(s, @"\p{C}+", String.Empty);
To remove the control characters only (if you don't want to remove the emojis 😎)
Regex.Replace(s, @"\p{Cc}+", String.Empty);