char vs wchar_t vs char16_t vs char32_t (c++11)
From what I understand, a char
is safe to house ASCII characters whereas char16_t
and char32_t
are safe to house characters from unicode, one for the 16-bit variety and another for the 32-bit variety (Should I have said "a" instead of "the"?). But I'm then left wondering what the purpose behind the wchar_t
is. Should I ever use that type in new code, or is it simply there to support old code? What was the purpose of wchar_t
in old code if, from what I understand, its size had no guarantee to be bigger than a char
? Clarification would be nice!
Solution 1:
char
is for 8-bit code units, char16_t
is for 16-bit code units, and char32_t
is for 32-bit code units. Any of these can be used for 'Unicode'; UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units, and UTF-32 uses 32-bit code units.
The guarantee made for wchar_t
was that any character supported in a locale could be converted from char
to wchar_t
, and whatever representation was used for char
, be it multiple bytes, shift codes, what have you, the wchar_t
would be a single, distinct value. The purpose of this was that then you could manipulate wchar_t
strings just like the simple algorithms used with ASCII.
For example, converting ascii to upper case goes like:
auto loc = std::locale("");
char s[] = "hello";
for (char &c : s) {
c = toupper(c, loc);
}
But this won't handle converting all characters in UTF-8 to uppercase, or all characters in some other encoding like Shift-JIS. People wanted to be able to internationalize this code like so:
auto loc = std::locale("");
wchar_t s[] = L"hello";
for (wchar_t &c : s) {
c = toupper(c, loc);
}
So every wchar_t
is a 'character' and if it has an uppercase version then it can be directly converted. Unfortunately this doesn't really work all the time; For example there exist oddities in some languages such as the German letter ß where the uppercase version is actually the two characters SS instead of a single character.
So internationalized text handling is intrinsically harder than ASCII and cannot really be simplified in the way the designers of wchar_t
intended. As such wchar_t
and wide characters in general provide little value.
The only reason to use them is that they've been baked into some APIs and platforms. However, I prefer to stick to UTF-8 in my own code even when developing on such platforms, and to just convert at the API boundaries to whatever encoding is required.
Solution 2:
The type wchar_t
was put into the standard when Unicode promised to create a 16 bit representation. Most vendors choose to make wchar_t
32 bits but one large vendor has chosen to to make it 16 bits. Since Unicode uses more than 16 bits (e.g., 20 bits) it was felt that we should have better character types.
The intent for char16_t
is to represent UTF16 and char32_t
is meant to directly represent Unicode characters. However, on systems using wchar_t
as part of their fundamental interface, you'll be stuck with wchar_t
. If you are unconstrained I would personally use char
to represent Unicode using UTF8. The problem with char16_t
and char32_t
is that they are not fully supported, not even in the standard C++ library: for example, there are no streams supporting these types directly and it more work than just instantiating the stream for these types.