How can I perform a Unicode aware character by character comparison? [closed]

Solution 1:

Encoding

Unicode defines a list of characters (letters, numbers, analphabetic symbols, control codes and others) but their representation (in bytes) is defined as encoding. Most common Unicode encodings nowadays are UTF-8, UTF-16 and UTF-32. UTF-16 is what usually is associated with Unicode because it's what has been chosen for Unicode support in Windows, Java, NET environment, C and C++ language (on Windows). Be aware it's not the only one and during your life you'll for sure also meet UTF-8 text (especially from web and on Linux file system) and UTF-32 (outside Windows world). A very introductory must read article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and UTF-8 Everywhere - Manifesto. IMO especially second link (regardless your opinion UTF-8 vs UTF-16) is pretty enlightening.

Let me quote Wikipedia:

Because the most commonly used characters are all in the Basic Multilingual Plane, handling of surrogate pairs is often not thoroughly tested. This leads to persistent bugs and potential security holes, even in popular and well-reviewed application software (e.g. CVE-2008-2938, CVE-2012-2135)

To see where issue is just start with some simple math: Unicode defines around 110K code points (note that not all of them are grapheme). "Unicode character type" in C, C++, C#, VB.NET, Java and many other languages on Windows environment (with notable exception of VBScript on old ASP classic pages) is UTF-16 encoded then it's two bytes (type name here is intuitive but completely misleading because it's a code unit, not a character nor a code point).

Please check this distinction because it's fundamental: a code unit is logically different from a Character and, even if sometimes they coincide, they're not same thing. How this affect your programming life? Imagine you have this C# code and your specifications (written by someone who thinks about true definition of Character) says "password length must be 4 characters":

bool IsValidPassword(string text ) {
    return text.Length >= 4;
}

That code is ugly, wrong and broken. Length property returns number of code units in text string variable and now your know they're different. Your code will validate n̊o̅ as valid password (but it's made by two characters, four code points - which almost always coincide with code units). Now try to imagine this applied to all layers of your application: an UTF-8 encoded database field naïvly validated with previous code (where input is UTF-16), errors will sum and your Polish friend Świętosław Koźmicki won't be happy of this. Now think you have to validate user's first name with same technique and your users are Chinese (but don't worry, if you don't care then they will be your users for very short time). Another example: this naïve C# algorithm to count distinct Characters in a string will fail for same reason:

myString.Distinct().Count()

If user enters this Han character 𠀑 then your code will wrongly return...2 because its UTF-16 representation is 0xD840 0xDC11 (BTW each of them, alone, is not a valid Unicode character because they're high and low surrogate, respectively). Reasons are explained in greater detail in this post, a working solution is also provided so I just repeat here essential code:

StringInfo.GetTextElementEnumerator(text)
    .AsEnumerable<string>()
    .Distinct()
    .Count();

This is roughly equivalent to codePointCount() in Java to count code points in a string. We need AsEnumerable<T>() because GetTextElementEnumerator() returns IEnumerator instead of IEnumerable, a simple implementation is described in Split a string into chunks of the same length (remember that to check Unicode Text Segmentation for all the rules, for example if trying to implement an ellipsis algorithm for text trimming).

Is this something only related to string length? Of course not, if you handle keyboard input Char by Char you may need to fix your code. See for example this question about Korean characters handled in KeyUp event.

Unrelated but IMO helpful to understand, this C code (taken from this post) works on char (ASCII/ANSI or UTF-8) but it'll fail if straight converted to use wchar_t:

wchar_t* pValue = wcsrchr(wcschr(pExpression, L'|'), L':') + 1;

Note that in C++ 11 there is a new great set of classes to handle encoding and clearer type aliases: char8_t (from C++ 20), char16_t and char32_t for, respectively, UTF-8, UTF-16 and UTF-32 encoded characters. Be aware that you also have std::u8string, std::u16string and std::u32string. Note that even if length() (and its size() alias) will still return count of code units you can easily perform encoding conversions with codecvt() template function and using these types IMO you'll make your code more clear and explicit (isn't astonishing size() of u16string will return number of char16_t elements). For more details about Characters counting in C++ check this nice post. In C things are pretty easier with char and UTF-8 encoding: this post IMO is a must-read.

Culture Differences

Not all Languages are similar, they don't even share some basic concepts. For example our current definition of grapheme can be pretty far from our concept of Character. Let me explain with an example: in Korean Hangul alphabet letters are combined into a single syllable (and both letters and syllables are Characters, just represented in a different way when alone and in a word with other letters). Word (Guk) is one syllable composed by three letters , and (first and last letter are the same but they're pronounced with different sounds when they're at beginning or end of a word, that's why they're transliterated g and k).

Syllables let us introduce another concept: precomposed and decomposed sequences. Hangul syllable han can be represented as a single character (U+0D55C) or a decomposed sequence of letters , and . If you're, for example, reading a text file you may have both (and users may enter both sequences in your input boxes) but they must compared equal. Note that if you type that letters sequentially they'll be displayed always as single syllable (copy & paste single characters - without spaces - and try) but final form (precomposed or decomposed) depends on your IME.

In Czech "ch" is a digraph and it's treated as a single letter. It has it's own rule for collation (it's between H and I), with Czech sorting fyzika comes before chemie! If you count Characters and you tell your users that word Chechtal is composed by 8 Characters they'll think your software is bugged and your support for their language is merely limited to a bunch of translated resources. Let's add exceptions: in puchoblík (and few other words) C and H are not a digraph and they're separated. Note that there are also other cases like "dž" in Slovak and others where it's counted as single character even if it uses two/three UTF-16 code points! Same happens also in many other languages too (for example ll in Catalan). True languages have more exceptions and special cases than PHP!

Note that appearance alone is not always enough for equivalence, for example: A (U+0041 LATIN CAPITAL LETTER A) is not equivalent to А (U+0410 CYRILLIC CAPITAL LETTER A). Conversely character ٢ (U+0662 ARABIC-INDIC DIGIT TWO) and ۲ (U+06F2 EXTENDED ARABIC-INDIC DIGIT TWO) are visually and conceptually equivalent but they are different Unicode code points (see also next paragraph about numbers and synonyms).

Symbols like ? and ! are sometimes used as characters, for example earliest Haida language). In some languages (like earliest written form of Native Americans languages) also numbers and other symbols have been borrowed from Latin alphabet and used as letters (mind this if you have to handle that languages and you have to strip alphanumeric from symbols, Unicode can't distinguish this), one example !Kung in Khoisan African language. In Catalan when ll is not a digraph they use a diacritic (or a middot (+U00B7)...) to separate characters, like in cel·les (in this case character count is 6 and code units/code points are 7 where an hypothetical non-existing word celles would result in 5 characters).

Same word may be written using in more than one form. This may be something you have to care about if, for example, you provide a full-text search. For example Chinese word 家 (house) can be transliterated as Jiā in pinyin and in Japanese same word may be also written with same Kanji 家 or as いえ in Hiragana (and others too) or transliterated in romaji as ie. Is this limited to words? No, also characters, for numbers is pretty common: 2 (Arabic number in Roman alphabet), ٢ (in Arabic and Persian) and (Chinese and Japanese) are exactly same cardinal number. Let's add some complexity: in Chinese it's also very common to write the same number as (simplified: ). I don't even mention prefixes (micro, nano, kilo and so on). See this post for a real world example of this issue. It's not limited to far-east languages only: apostrophe (U+0027 APOSTROPHE or better (U+2019 RIGHT SINGLE QUOTATION MARK) is used often in Czech and Slovak instead of its superimposed counterpart (U+02BC MODIFIER LETTER APOSTROPHE): and d' are then equivalent (similar to what I said about middot in Catalan).

Maybeyou should properly handle lower case "ss" in German to be compared to ß (and problems will arise for case insensitive comparison). Similar issue is in Turkish if you have to provide a non-exact string matching for i and its forms (see section about Case).

If you're working with professional text you may also meet ligatures; even in English, for example æsthetics is 9 code points but 10 characters! Same applies, for example, for ethel character œ (U+0153 LATIN SMALL LIGATURE OE, absolutely necessary if you're working with French text); horse d'ouvre is equivalent to horse d'œvre (but also ethel and œthel). Both are (together with German ß) lexical ligatures but you may also meet typographical ligatures (such as U+FB00 LATIN SMALL LIGATURE FF) and they have they're own part on Unicode character set (presentation forms). Nowadays diacritics are much more common even in English (see tchrist's post about people freed of the tyranny of the typewriter, please read carefully Bringhurst's citation). Do you think you (and your users) won't ever type façade, naïve and prêt-à-porter or "classy" noöne or coöperation?

Here I don't even mention word counting because it'll open even more problems: in Korean each word is composed by syllables but in, for example, Chinese and Japanese, Characters are counted as words (unless you want to implement word counting using a dictionary). Now let's take this Chinese sentence: 这是一个示例文本 rougly equivalent to Japanese sentence これは、サンプルのテキストです. How do you count them? Moreover if they're transliterated to Shì yīgè shìlì wénběn and Kore wa, sanpuru no tekisutodesu then they should be matched in a text search?

Speaking about Japanese: full width Latin Characters are different from half width Characters and if your input is Japanese romaji text you have to handle this otherwise your users will be astonished when won't compare equal to T (in this case what should be just glyphs became code points). Remember this if you're giving, for example, markdown files to translate because [name](link) parsing might be broken because of this.

OK, is this enough to highlight problem surface?

Duplicated Characters

Unicode (primary for ASCII compatibility and other historical reasons) has duplicated characters, before you do a comparison you have to perform normalization otherwise à (single code point) won't be equal to (a plus U+0300 COMBINING GRAVE ACCENT). Is this a corner uncommon case? Not really, also take a look to this real world example from Jon Skeet. Also (see section Culture Difference) precomposed and decomposed sequences introduce duplicates.

Note that diacritics are not only source of confusion. When user is typing with his keyboard he'll probably enter ' (U+0027 APOSTROPHE) but it's supposed to match also (U+2019 RIGHT SINGLE QUOTATION MARK) normally used in typography (same is true for many many Unicode symbols almost equivalent from user point of view but distinct in typography, imagine to write a text search inside digital books).

In short two strings must be considered equal (this is a very important concept!) if they are canonically equivalent and they are canonically equivalent if they have the same linguistic meaning and appearance, even if they are composed from different Unicode code points.

Case

If you have to perform case insensitive comparison then you'll have even more problems. I assume you do not perform hobbyist case insensitive comparison using toupper() or equivalent unless, one for all, you want to explain to your users why 'i'.ToUpper() != 'I' for Turkish language (I is not upper case of i which is İ. BTW lower case letter for I is ı).

Another problem is eszett ß in German (a ligature for long s + short s used - in ancient times - also in English elevated to dignity of a character). It has an upper case version but (at this moment) .NET Framework wrongly returns "ẞ" != "ß".ToUpper() (but its use is mandatory in some scenarios, see also this post). Unfortunately not always ss becomes (upper case), not always ss is equal to ß (lower case) and also sz sometimes is in upper case. Confusing, right?

Even More

Globalization is not only about text: what about dates and calendars, number formatting and parsing, colors and layout. A book won't be enough to describe all things you should care about but what I would highlight here is that few localized strings won't make your application ready for an international market.

Even just about text more questions arise: how this applies to regex? How spaces should be handled? Is an em space equal to an en space? In a professional application how "U.S.A." should be compared with "USA" (in a free-text search)? On the same line of thinking: how to manage diacritics in comparison?

How to handle text storage? Forget you can safely detect encoding, to open a file you need to know its encoding. Of course unless you're planning to do like HTML parsers with <meta charset="UTF-8"> or XML/XHTML encoding="UTF-8" in <?xml>).

Historical "Introduction"

What we see as text on our monitors is just a chunk of bytes in computer memory. By convention each value (or group of values, like an int32_t represents a number) represents a character. How that character is then drawn on screen is delegated to something else (to simplify little bit think about a font).

If we arbitrary decide that each character is represented with one byte then we have available 256 symbols (as when we use int8_t, System.SByte or java.lang.Byte for a number we have a numeric range of 256 values). What we need now to so decide each value which character it represents, an example of this is ASCII (limited to 7 bits, 128 values) with custom extensions to also use upper 128 values.

That's done, habemus character encoding for 256 symbols (including letters, numbers, analphabetic characters and control codes). Yes each ASCII extension is proprietary but things are clear and easy to manage. Text processing is so common that we just need to add a proper data type in our favorite languages (char in C, note that formally it's not an alias for unsigned char or signed char but a distinct type; char in Pascal; character in FORTRAN and so on) and few library functions to manage that.

Unfortunately it's not so easy. ASCII is limited to a very basic character set and it includes only latin characters used in USA (that's why its preferred name should be usASCII). It's so limited that even English words with diacritical marks aren't supported (if this made the change in modern language or vice-versa is another story). You'll see it also has other problems (for example its wrong sorting order with the problems of ordinal and alphabetic comparison).

How to deal with that? Introduce a new concept: code pages. Keep a fixed set of basic characters (ASCII) and add another 128 characters specific for each language. Value 0x81 will represent Cyrillic character Б (in DOS code page 866) and Greek character Ϊ (in DOS code page 869).

Now serious problems arise: 1) you cannot mix in the same text file different alphabets. 2) To properly understand a text you have to also know with which code page it's expressed. Where? There is not a standard method for that and you'll have to handle this asking user or with a reasonable guess (?!). Even nowadays ZIP file "format" is limited to ASCII for file names (you may use UTF-8 - see later - but it's not standard - because there is not a standard ZIP format). In this post a Java working solution. 3) Even code pages are not standard and each environment has different sets (even DOS code pages and Windows code pages are different) and also names vary. 4) 255 characters are still too few for, for example, Chinese or Japanese language then more complicated encodings have been introduced (Shift JIS, for example).

Situation was terrible at that time (~ 1985) and a standard was absolutely needed. ISO/IEC 8859 arrived and it, at least, solved point 3 in previous problem list. Point 1, 2 and 4 were still unsolved and a solution was needed (especially if your target is not just raw text but also special typography characters). This standard (after many revisions) is still with us nowadays (and it somehow coincides with Windows-1252 code page) but probably you won't ever use it unless you're working with some legacy system.

Standard which emerged to save us from this chaos is world wide known: Unicode. From Wikipedia:

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems. [...] the latest version of Unicode contains a repertoire of more than 110,000 characters covering 100 scripts and multiple symbol sets.

Languages, libraries, Operating Systems have been updated to support Unicode. Now we have all characters we need, a shared well-known code for each, and the past is just a nightmare. Replace char with wchar_t (and accept to live with wcout, wstring and friends), just use System.Char or java.lang.Character and live happy. Right?

NO. It's never so easy. Unicode mission is about "...encoding, representation and handling of text...", it doesn't translate and adapt different cultures into an abstract code (and it's impossible to do it unless you kill the beauty in the variety of all our languages). Moreover encoding itself introduces some (not so obvious?!) things we have to care about.