How do I detect non-printable characters in .NET?
I'm just wondering if there is a method in .NET 2.0 that checks whether a character is printable or not – something like isprint(int)
from standard C.
I found Char.IsControl(Char)
.
Could that be used for this purpose?
Solution 1:
You might want to use Char.IsControl(Char)
. That is what I'm using. You definitely do not want to use the <0x20
method because any non-latin character and most non-english characters will be above 127.
Solution 2:
If by printable you mean renders something - even if that something is blank space (whitespace), [negating] Char.IsControl()
alone is not enough to determine if a character is printable.
It isn't enough even in the single-byte
U+0000
-U+00FF
Unicode range (which is compatible with ASCII / ISO-8859-1), because the ASCII whitespace characters other than the space character are also classified as control characters, so thatChar.IsControl('\t')
andChar.IsControl('\n')
report true as well.Beyond the single-byte range, there are other categories of non-rendering characters that must be recognized.
A solution for the single-byte U+0000
- U+00FF
Unicode range (which is compatible with ASCII / ISO-8859-1):
// Sample input char.
char c = (char)0x20; // space
var isPrintable = ! Char.IsControl(c) || Char.IsWhiteSpace(c);
An approximation of a solution for all Unicode characters:
Sadly, there is no simple solution that is complete:
A fundamental limitation of a
Char
-based test is that typeChar
can only represent characters up to code pointU+FFFF
, i.e., only characters in the so-called BMP (basic multi-lingual plane). Characters outside the BMP - with higher code points - must be represented as twoChar
instances (so-called surrogate pairs).The
UnicodeCategory.PrivateUse
category of characters, as the name suggests, is not standardized; for instance,U+F8FF
on macOS contains the Apple symbol, whereas it is undefined on Windows. So it may contain printable characters, and you'd have to determine dynamically whether they are printable.-
The
UnicodeCategory.Format
category mostly contains non-rendering characters, but there are exceptions - see this table.- You could hard-code these exceptions for a given version of the Unicode standard, but that is cumbersome and may become obsolete over time.
Thus, the following code assumes that all characters in UnicodeCategory.PrivateUse
and UnicodeCategory.Format
are printable, which, means that at least some characters will be misclassified.
using System;
using System.Linq;
using System.Globalization;
// ...
// Sample input char.
char c = (char)0x20; // space
// The set of Unicode character categories containing non-rendering,
// unknown, or incomplete characters.
// !! Unicode.Format and Unicode.PrivateUse can NOT be included in
// !! this set, because they may (private-use) or do (format)
// !! contain at least *some* rendering characters.
var nonRenderingCategories = new UnicodeCategory[] {
UnicodeCategory.Control,
UnicodeCategory.OtherNotAssigned,
UnicodeCategory.Surrogate };
// Char.IsWhiteSpace() includes the ASCII whitespace characters that
// are categorized as control characters. Any other character is
// printable, unless it falls into the non-rendering categories.
var isPrintable = Char.IsWhiteSpace(c) ||
! nonRenderingCategories.Contains(Char.GetUnicodeCategory(c));
Solution 3:
In addition to Char.IsControlChar()
there are several other functions that can be used to determine what category a given char value is:
IsLetter()
IsNumber()
IsDigit()
IsLetterOrDigit()
IsSymbol()
IsPunctuation()
IsSeparator()
IsWhiteSpace()
If what you have is a "traditional ASCII text" file, and you want to use supplied functions, the expression:
(Char.IsLetterOrDigit(ch) || Char.IsPunctuation(ch) || Char.IsSymbol(ch) || (ch==' '))
should work.
Now, if you are working with Unicode, you are opening a can or worms. Even back in the day, whether a space is printable or not printable was open to interpretation (hence the isprint()
and isgraph()
functions). See this related question and answers about "printable" unicode characters.