What's "wrong" with C++ wchar_t and wstrings? What are some alternatives to wide characters?

I have seen a lot of people in the C++ community(particularly ##c++ on freenode) resent the use of wstrings and wchar_t, and their use in the windows api. What is exactly "wrong" with wchar_t and wstring, and if I want to support internationalization, what are some alternatives to wide characters?

Solution 1:

What is wchar_t?

wchar_t is defined such that any locale's char encoding can be converted to a wchar_t representation where every wchar_t represents exactly one codepoint:

Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1).

— C++ [basic.fundamental] 3.9.1/5

This does not require that wchar_t be large enough to represent any character from all locales simultaneously. That is, the encoding used for wchar_t may differ between locales. Which means that you cannot necessarily convert a string to wchar_t using one locale and then convert back to char using another locale.¹

Since using wchar_t as a common representation between all locales seems to be the primary use for wchar_t in practice you might wonder what it's good for if not that.

The original intent and purpose of wchar_t was to make text processing simple by defining it such that it requires a one-to-one mapping from a string's code-units to the text's characters, thus allowing the use of the same simple algorithms as are used with ascii strings to work with other languages.

Unfortunately the wording of wchar_t's specification assume a one-to-one mapping between characters and codepoints to achieve this. Unicode breaks that assumption², so you can't safely use wchar_t for simple text algorithms either.

This means that portable software cannot use wchar_t either as a common representation for text between locales, or to enable the use of simple text algorithms.

What use is wchar_t today?

Not much, for portable code anyway. If __STDC_ISO_10646__ is defined then values of wchar_t directly represent Unicode codepoints with the same values in all locales. That makes it safe to do the inter-locale conversions mentioned earlier. However you can't rely only on it to decide that you can use wchar_t this way because, while most unix platforms define it, Windows does not even though Windows uses the same wchar_t locale in all locales.

The reason Windows doesn't define __STDC_ISO_10646__ is because Windows use UTF-16 as its wchar_t encoding, and because UTF-16 uses surrogate pairs to represent codepoints greater than U+FFFF, which means that UTF-16 doesn't satisfy the requirements for __STDC_ISO_10646__.

For platform specific code wchar_t may be more useful. It's essentially required on Windows (e.g., some files simply cannot be opened without using wchar_t filenames), though Windows is the only platform where this is true as far as I know (so maybe we can think of wchar_t as 'Windows_char_t').

In hindsight wchar_t is clearly not useful for simplifying text handling, or as storage for locale independent text. Portable code should not attempt to use it for these purposes. Non-portable code may find it useful simply because some API requires it.

Alternatives

The alternative I like is to use UTF-8 encoded C strings, even on platforms not particularly friendly toward UTF-8.

This way one can write portable code using a common text representation across platforms, use standard datatypes for their intended purpose, get the language's support for those types (e.g. string literals, though some tricks are necessary to make it work for some compilers), some standard library support, debugger support (more tricks may be necessary), etc. With wide characters it's generally harder or impossible to get all of this, and you may get different pieces on different platforms.

One thing UTF-8 does not provide is the ability to use simple text algorithms such as are possible with ASCII. In this UTF-8 is no worse than any other Unicode encoding. In fact it may be considered to be better because multi-code unit representations in UTF-8 are more common and so bugs in code handling such variable width representations of characters are more likely to be noticed and fixed than if you try to stick to UTF-32 with NFC or NFKC.

Many platforms use UTF-8 as their native char encoding and many programs do not require any significant text processing, and so writing an internationalized program on those platforms is little different from writing code without considering internationalization. Writing more widely portable code, or writing on other platforms requires inserting conversions at the boundaries of APIs that use other encodings.

Another alternative used by some software is to choose a cross-platform representation, such as unsigned short arrays holding UTF-16 data, and then to supply all the library support and simply live with the costs in language support, etc.

C++11 adds new kinds of wide characters as alternatives to wchar_t, char16_t and char32_t with attendant language/library features. These aren't actually guaranteed to be UTF-16 and UTF-32, but I don't imagine any major implementation will use anything else. C++11 also improves UTF-8 support, for example with UTF-8 string literals so it won't be necessary to trick VC++ into producing UTF-8 encoded strings (although I may continue to do so rather than use the u8 prefix).

Alternatives to avoid

TCHAR: TCHAR is for migrating ancient Windows programs that assume legacy encodings from char to wchar_t, and is best forgotten unless your program was written in some previous millennium. It's not portable and is inherently unspecific about its encoding and even its data type, making it unusable with any non-TCHAR based API. Since its purpose is migration to wchar_t, which we've seen above isn't a good idea, there is no value whatsoever in using TCHAR.

_{1. Characters which are representable in wchar_t strings but which are not supported in any locale are not required to be represented with a single wchar_t value. This means that wchar_t could use a variable width encoding for certain characters, another clear violation of the intent of wchar_t. Although it's arguable that a character being representable by wchar_t is enough to say that the locale 'supports' that character, in which case variable-width encodings aren't legal and Window's use of UTF-16 is non-conformant.}

_{2. Unicode allows many characters to be represented with multiple code points, which creates the same problems for simple text algorithms as variable width encodings. Even if one strictly maintains a composed normalization, some characters still require multiple code points. See: http://www.unicode.org/standard/where/}

Solution 2:

There's nothing "wrong" with wchar_t. The problem is that, back in NT 3.x days, Microsoft decided that Unicode was Good (it is), and to implement Unicode as 16-bit, wchar_t characters. So most Microsoft literature from the mid-90's pretty much equated Unicode == utf16 == wchar_t.

Which, sadly, is not at all the case. "Wide characters" are not necessarily 2 bytes, on all platforms, under all circumstances.

This is one of the best primers on "Unicode" (independent of this question, independent of C++) I've ever seen: I highly recommend it:

http://www.joelonsoftware.com/articles/Unicode.html

And I honestly believe the best way to deal with "8-bit ASCII" vs "Win32 wide characters" vs "wchar_t-in-general" is simply to accept that "Windows is Different" ... and code accordingly.

IMHO...

PS:

I totally agree with jamesdlin above:

On Windows, you don't really have a choice. Its internal APIs were designed for UCS-2, which was reasonable at the time since it was before the variable-length UTF-8 and UTF-16 encodings were standardized. But now that they support UTF-16, they've ended up with the worst of both worlds.