reinterpret_cast between char* and std::uint8_t* - safe?
Now we all sometimes have to work with binary data. In C++ we work with sequences of bytes, and since the beginning char
was the our building block. Defined to have sizeof
of 1, it is the byte. And all library I/O functions use char
by default. All is good but there was always a little concern, a little oddity that bugged some people - the number of bits in a byte is implementation-defined.
So in C99, it was decided to introduce several typedefs to let the developers easily express themselves, the fixed-width integer types. Optional, of course, since we never want to hurt portability. Among them, uint8_t
, migrated into C++11 as std::uint8_t
, a fixed width 8-bit unsigned integer type, was the perfect choice for people who really wanted to work with 8 bit bytes.
And so, developers embraced the new tools and started building libraries that expressively state that they accept 8-bit byte sequences, as std::uint8_t*
, std::vector<std::uint8_t>
or otherwise.
But, perhaps with a very deep thought, the standardization committee decided not to require implementation of std::char_traits<std::uint8_t>
therefore prohibiting developers from easily and portably instantiating, say, std::basic_fstream<std::uint8_t>
and easily reading std::uint8_t
s as a binary data. Or maybe, some of us don't care about the number of bits in a byte and are happy with it.
But unfortunately, two worlds collide and sometimes you have to take a data as char*
and pass it to a library that expects std::uint8_t*
. But wait, you say, isn't char
variable bit and std::uint8_t
is fixed to 8? Will it result into a loss of data?
Well, there is an interesting Standardese on this. The char
defined to hold exactly one byte and byte is the lowest addressable chunk of memory, so there can't be a type with bit width lesser than that of char
. Next, it is defined to be able to hold UTF-8 code units. This gives us the minimum - 8 bits. So now we have a typedef which is required to be 8 bits wide and a type that is at least 8 bits wide. But are there alternatives? Yes, unsigned char
. Remember that signedness of char
is implementation-defined. Any other type? Thankfully, no. All other integral types have required ranges which fall outside of 8 bits.
Finally, std::uint8_t
is optional, that means that the library which uses this type will not compile if it's not defined. But what if it compiles? I can say with a great degree of confidence that this means that we are on a platform with 8 bit bytes and CHAR_BIT == 8
.
Once we have this knowledge, that we have 8-bit bytes, that std::uint8_t
is implemented as either char
or unsigned char
, can we assume that we can do reinterpret_cast
from char*
to std::uint8_t*
and vice versa? Is it portable?
This is where my Standardese reading skills fail me. I read about safely derived pointers ([basic.stc.dynamic.safety]
) and, as far as I understand, the following:
std::uint8_t* buffer = /* ... */ ;
char* buffer2 = reinterpret_cast<char*>(buffer);
std::uint8_t buffer3 = reinterpret_cast<std::uint8_t*>(buffer2);
is safe if we don't touch buffer2
. Correct me if I'm wrong.
So, given the following preconditions:
CHAR_BIT == 8
-
std::uint8_t
is defined.
Is it portable and safe to cast char*
and std::uint8_t*
back and forth, assuming that we're working with binary data and the potential lack of sign of char
doesn't matter?
I would appreciate references to the Standard with explanations.
EDIT: Thanks, Jerry Coffin. I'm going to add the quote from the Standard ([basic.lval], §3.10/10):
If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:
...
— a char or unsigned char type.
EDIT2: Ok, going deeper. std::uint8_t
is not guaranteed to be a typedef of unsigned char
. It can be implemented as extended unsigned integer type and extended unsigned integer types are not included in §3.10/10. What now?
Ok, let's get truly pedantic. After reading this, this and this, I'm pretty confident that I understand the intention behind both Standards.
So, doing reinterpret_cast
from std::uint8_t*
to char*
and then dereferencing the resulting pointer is safe and portable and is explicitly permitted by [basic.lval].
However, doing reinterpret_cast
from char*
to std::uint8_t*
and then dereferencing the resulting pointer is a violation of strict aliasing rule and is undefined behavior if std::uint8_t
is implemented as extended unsigned integer type.
However, there are two possible workarounds, first:
static_assert(std::is_same_v<std::uint8_t, char> ||
std::is_same_v<std::uint8_t, unsigned char>,
"This library requires std::uint8_t to be implemented as char or unsigned char.");
With this assert in place, your code will not compile on platforms on which it would result in undefined behavior otherwise.
Second:
std::memcpy(uint8buffer, charbuffer, size);
Cppreference says that std::memcpy
accesses objects as arrays of unsigned char
so it is safe and portable.
To reiterate, in order to be able to reinterpret_cast
between char*
and std::uint8_t*
and work with resulting pointers portably and safely in a 100% standard-conforming way, the following conditions must be true:
-
CHAR_BIT == 8
. -
std::uint8_t
is defined. -
std::uint8_t
is implemented aschar
orunsigned char
.
On a practical note, the above conditions are true on 99% of platforms and there is likely no platform on which the first 2 conditions are true while the 3rd one is false.
If uint8_t
exists at all, essentially the only choice is that it's a typedef for unsigned char
(or char
if it happens to be unsigned). Nothing (but a bitfield) can represent less storage than a char
, and the only other type that can be as small as 8 bits is a bool
. The next smallest normal integer type is a short
, which must be at least 16 bits.
As such, if uint8_t
exists at all, you really only have two possibilities: you're either casting unsigned char
to unsigned char
, or casting signed char
to unsigned char
.
The former is an identity conversion, so obviously safe. The latter falls within the "special dispensation" given for accessing any other type as a sequence of char or unsigned char in §3.10/10, so it also gives defined behavior.
Since that includes both char
and unsigned char
, a cast to access it as a sequence of char also gives defined behavior.
Edit: As far as Luc's mention of extended integer types goes, I'm not sure how you'd manage to apply it to get a difference in this case. C++ refers to the C99 standard for the definitions of uint8_t
and such, so the quotes throughout the remainder of this come from C99.
§6.2.6.1/3 specifies that unsigned char
shall use a pure binary representation, with no padding bits. Padding bits are only allowed in 6.2.6.2/1, which specifically excludes unsigned char
. That section, however, describes a pure binary representation in detail -- literally to the bit. Therefore, unsigned char
and uint8_t
(if it exists) must be represented identically at the bit level.
To see a difference between the two, we have to assert that some particular bits when viewed as one would produce results different from when viewed as the other -- despite the fact that the two must have identical representations at the bit level.
To put it more directly: a difference in result between the two requires that they interpret bits differently -- despite a direct requirement that they interpret bits identically.
Even on a purely theoretical level, this appears difficult to achieve. On anything approaching a practical level, it's obviously ridiculous.