Why isn't there an endianness modifier in C++ like there is for signedness?
(I guess this question could apply to many typed languages, but I chose to use C++ as an example.)
Why is there no way to just write:
struct foo {
little int x; // little-endian
big long int y; // big-endian
short z; // native endianness
};
to specify the endianness for specific members, variables and parameters?
Comparison to signedness
I understand that the type of a variable not only determines how many bytes are used to store a value but also how those bytes are interpreted when performing computations.
For example, these two declarations each allocate one byte, and for both bytes, every possible 8-bit sequence is a valid value:
signed char s;
unsigned char u;
but the same binary sequence might be interpreted differently, e.g. 11111111
would mean -1 when assigned to s
but 255 when assigned to u
. When signed and unsigned variables are involved in the same computation, the compiler (mostly) takes care of proper conversions.
In my understanding, endianness is just a variation of the same principle: a different interpretation of a binary pattern based on compile-time information about the memory in which it will be stored.
It seems obvious to have that feature in a typed language that allows low-level programming. However, this is not a part of C, C++ or any other language I know, and I did not find any discussion about this online.
Update
I'll try to summarize some takeaways from the many comments that I got in the first hour after asking:
- signedness is strictly binary (either signed or unsigned) and will always be, in contrast to endianness, which also has two well-known variants (big and little), but also lesser-known variants such as mixed/middle endian. New variants might be invented in the future.
- endianness matters when accessing multiple-byte values byte-wise. There are many aspects beyond just endianness that affect the memory layout of multi-byte structures, so this kind of access is mostly discouraged.
- C++ aims to target an abstract machine and minimize the number of assumptions about the implementation. This abstract machine does not have any endianness.
Also, now I realize that signedness and endianness are not a perfect analogy, because:
- endianness only defines how something is represented as a binary sequence, but now what can be represented. Both
big int
andlittle int
would have the exact same value range. - signedness defines how bits and actual values map to each other, but also affects what can be represented, e.g. -3 can't be represented by an
unsigned char
and (assuming thatchar
has 8 bits) 130 can't be represented by asigned char
.
So that changing the endianness of some variables would never change the behavior of the program (except for byte-wise access), whereas a change of signedness usually would.
What the standard says
[intro.abstract]/1
:The semantic descriptions in this document define a parameterized nondeterministic abstract machine. This document places no requirement on the structure of conforming implementations. In particular, they need not copy or emulate the structure of the abstract machine. Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine as explained below.
C++ could not define an endianness qualifier since it has no concept of endianness.
Discussion
About the difference between signness and endianness, OP wrote
In my understanding, endianness is just a variation of the same principle [(signness)]: a different interpretation of a binary pattern based on compile-time information about the memory in which it will be stored.
I'd argue signness both have a semantic and a representative aspect1. What [intro.abstract]/1
implies is that C++ only care about semantic, and never addresses the way a signed number should be represented in memory2. Actually, "sign bit" only appears once in the C++ specs and refer to an implementation-defined value.
On the other hand, endianness only have a representative aspect: endianness conveys no meaning.
With C++20, std::endian
appears. It is still implementation-defined, but let us test the endian of the host without depending on old tricks based on undefined behaviour.
1) Semantic aspect: an signed integer can represent values below zero; representative aspect: one need to, for example, reserve a bit to convey the positive/negative sign.
2) In the same vein, C++ never describe how a floating point number should be represented, IEEE-754 is often used, but this is a choice made by the implementation, in any case enforced by the standard: [basic.fundamental]/8
"The value representation of floating-point types is implementation-defined".
In addition to YSC's answer, let's take your sample code, and consider what it might aim to achieve
struct foo {
little int x; // little-endian
big long int y; // big-endian
short z; // native endianness
};
You might hope that this would exactly specify layout for architecture-independent data interchange (file, network, whatever)
But this can't possibly work, because several things are still unspecified:
- data type size: you'd have to use
little int32_t
,big int64_t
andint16_t
respectively, if that's what you want - padding and alignment, which cannot be controlled strictly within the language: use
#pragma
or__attribute__((packed))
or some other compiler-specific extension - actual format (1s- or 2s-complement signedness, floating-point type layout, trap representations)
Alternatively, you might simply want to reflect the endianness of some specified hardware - but big
and little
don't cover all the possibilities here (just the two most common).
So, the proposal is incomplete (it doesn't distinguish all reasonable byte-ordering arrangements), ineffective (it doesn't achieve what it sets out to), and has additional drawbacks:
-
Performance
Changing the endianness of a variable from the native byte ordering should either disable arithmetic, comparisons etc (since the hardware cannot correctly perform them on this type), or must silently inject more code, creating natively-ordered temporaries to work on.
The argument here isn't that manually converting to/from native byte order is faster, it's that controlling it explicitly makes it easier to minimise the number of unnecessary conversions, and much easier to reason about how code will behave, than if the conversions are implicit.
-
Complexity
Everything overloaded or specialized for integer types now needs twice as many versions, to cope with the rare event that it gets passed a non-native-endianness value. Even if that's just a forwarding wrapper (with a couple of casts to translate to/from native ordering), it's still a lot of code for no discernible benefit.
The final argument against changing the language to support this is that you can easily do it in code. Changing the language syntax is a big deal, and doesn't offer any obvious benefit over something like a type wrapper:
// store T with reversed byte order
template <typename T>
class Reversed {
T val_;
static T reverse(T); // platform-specific implementation
public:
explicit Reversed(T t) : val_(reverse(t)) {}
Reversed(Reversed const &other) : val_(other.val_) {}
// assignment, move, arithmetic, comparison etc. etc.
operator T () const { return reverse(val_); }
};
Integers (as a mathematical concept) have the concept of positive and negative numbers. This abstract concept of sign has a number of different implementations in hardware.
Endianness is not a mathematical concept. Little-endian is a hardware implementation trick to improve the performance of multi-byte twos-complement integer arithmetic on a microprocessor with 16 or 32 bit registers and an 8-bit memory bus. Its creation required using the term big-endian to describe everything else that had the same byte-order in registers and in memory.
The C abstract machine includes the concept of signed and unsigned integers, without details -- without requiring twos-complement arithmetic, 8-bit bytes or how to store a binary number in memory.
PS: I agree that binary data compatibility on the net or in memory/storage is a PIA.
That's a good question and I have often thought something like this would be useful. However you need to remember that C aims for platform independence and endianness is only important when a structure like this is converted into some underlying memory layout. This conversion can happen when you cast a uint8_t buffer into an int for example. While an endianness modifier looks neat the programmer still needs to consider other platform differences such as int sizes and structure alignment and packing. For defensive programming when you want find grain control over how some variables or structures are represented in a memory buffer then it is best to code explicit conversion functions and then let the compiler optimiser generate the most efficient code for each supported platform.
Endianness is not inherently a part of a data type but rather of its storage layout.
As such, it would not be really akin to signed/unsigned but rather more like bit field widths in structs. Similar to those, they could be used for defining binary APIs.
So you'd have something like
int ip : big 32;
which would define both storage layout and integer size, leaving it to the compiler to do the best job of matching use of the field to its access. It's not obvious to me what the allowed declarations should be.