Is C++20 'char8_t' the same as our old 'char'?
In the CPP reference documentation,
I noticed for char
The character types are large enough to represent any UTF-8 eight-bit code unit (since C++14)
and for char8_t
type for UTF-8 character representation, required to be large enough to represent any UTF-8 code unit (8 bits)
Does that mean both are the same type? Or does char8_t
have some other feature?
Disclaimer: I'm the author of the char8_t
P0482 and P1423 proposals.
In C++20, char8_t
is a distinct type from all other types. In the related proposal for C, N2653, char8_t
is a typedef of unsigned char
similar to the existing typedefs for char16_t
and char32_t
.
In C++20, char8_t
has an underlying representation that matches unsigned char
. It therefore has the same size (at least 8-bit, but may be larger), alignment, and integer conversion rank as unsigned char
, but has different aliasing rules.
In particular, char8_t
was not added to the list of types at [basic.lval]p11. [basic.life]p6.4, [basic.types]p2, or [basic.types]p4. This means that, unlike unsigned char
, it cannot be used for the underlying storage of objects of another type, nor can it be used to examine the underlying representation of objects of other types; in other words, it cannot be used to alias other types. A consequence of this is that objects of type char8_t
can be accessed via pointers to char
or unsigned char
, but pointers to char8_t
cannot be used to access char
or unsigned char
data. In other words:
reinterpret_cast<const char *>(u8"text"); // Ok.
reinterpret_cast<const char8_t*>("text"); // Undefined behavior.
The motivation for a distinct type with these properties is:
-
To provide a distinct type for UTF-8 character data vs character data with an encoding that is either locale dependent or that requires separate specification.
-
To enable overloading for ordinary string literals vs UTF-8 string literals (since they may have different encodings).
-
To ensure an unsigned type for UTF-8 data (whether
char
is signed or unsigned is implementation defined). -
To enable better performance via a non-aliasing type; optimizers can better optimize types that do not alias other types.
char8_t
is not the same as char
. It behaves exactly the same as unsigned char
though per [basic.fundamental]/9
Type
char8_t
denotes a distinct type whose underlying type isunsigned char
. Typeschar16_t
andchar32_t
denote distinct types whose underlying types areuint_least16_t
anduint_least32_t
, respectively, in<cstdint>.
emphasis mine
Do note that since the standard calls it a distinct type, code like
std::cout << std::is_same_v<unsigned char, char8_t>;
will print 0
(false), even though char8_t
is implemented as a unsigned char
. This is because it is not an alias, but a distinct type.
Another thing to note is that char
can either be implemented as a signed char
or unsigned char
. That means it is possible for char
to have the same range and representation as char8_t
, but they are still separate types. char
, signed char
, unsigned char
, and char8_t
are the same size, but they are all distinct types.