How do I organize members in a struct to waste the least space on alignment?
Solution 1:
(Don't apply these rules without thinking. See ESR's point about cache locality for members you use together. And in multi-threaded programs, beware false sharing of members written by different threads. Generally you don't want per-thread data in a single struct at all for this reason, unless you're doing it to control the separation with a large alignas(128)
. This applies to atomic
and non-atomic vars; what matters is threads writing to cache lines regardless of how they do it.)
Rule of thumb: largest to smallest alignof()
. There's nothing you can do that's perfect everywhere, but by far the most common case these days is a sane "normal" C++ implementation for a normal 32 or 64-bit CPU. All primitive types have power-of-2 sizes.
Most types have alignof(T) = sizeof(T)
, or alignof(T)
capped at the register width of the implementation. So larger types are usually more-aligned than smaller types.
Struct-packing rules in most ABIs give struct members their absolute alignof(T)
alignment relative to the start of the struct, and the struct itself inherits the largest alignof()
of any of its members.
-
Put always-64-bit members first (like
double
,long long
, andint64_t
). ISO C++ of course doesn't fix these types at 64 bits / 8 bytes, but in practice on all CPUs you care about they are. People porting your code to exotic CPUs can tweak struct layouts to optimize if necessary. -
then pointers and pointer-width integers:
size_t
,intptr_t
, andptrdiff_t
(which may be 32 or 64-bit). These are all the same width on normal modern C++ implementations for CPUs with a flat memory model.Consider putting linked-list and tree left/right pointers first if you care about x86 and Intel CPUs. Pointer-chasing through nodes in a tree or linked list has penalties when the struct start address is in a different 4k page than the member you're accessing. Putting them first guarantees that can't be the case.
-
then
long
(which is sometimes 32-bit even when pointers are 64-bit, in LLP64 ABIs like Windows x64). But it's guaranteed at least as wide asint
. -
then 32-bit
int32_t
,int
,float
,enum
. (Optionally separateint32_t
andfloat
ahead ofint
if you care about possible 8 / 16-bit systems that still pad those types to 32-bit, or do better with them naturally aligned. Most such systems don't have wider loads (FPU or SIMD) so wider types have to be handled as multiple separate chunks all the time anyway).ISO C++ allows
int
to be as narrow as 16 bits, or arbitrarily wide, but in practice it's a 32-bit type even on 64-bit CPUs. ABI designers found that programs designed to work with 32-bitint
just waste memory (and cache footprint) ifint
was wider. Don't make assumptions that would cause correctness problems, but for "portable performance" you just have to be right in the normal case.People tuning your code for exotic platforms can tweak if necessary. If a certain struct layout is perf-critical, perhaps comment on your assumptions and reasoning in the header.
-
then
short
/int16_t
-
then
char
/int8_t
/bool
-
(for multiple
bool
flags, especially if read-mostly or if they're all modified together, consider packing them with 1-bit bitfields.)
(For unsigned integer types, find the corresponding signed type in my list.)
A multiple-of-8 byte array of narrower types can go earlier if you want it to. But if you don't know the exact sizes of types, you can't guarantee that int i
+ char buf[4]
will fill an 8-byte aligned slot between two double
s. But it's not a bad assumption, so I'd do it anyway if there was some reason (like spatial locality of members accessed together) for putting them together instead of at the end.
Exotic types: x86-64 System V has alignof(long double) = 16
, but i386 System V has only alignof(long double) = 4
, sizeof(long double) = 12
. It's the x87 80-bit type, which is actually 10 bytes but padded to 12 or 16 so it's a multiple of its alignof, making arrays possible without violating the alignment guarantee.
And in general it gets trickier when your struct members themselves are aggregates (struct or union) with a sizeof(x) != alignof(x)
.
Another twist is that in some ABIs (e.g. 32-bit Windows if I recall correctly) struct members are aligned to their size (up to 8 bytes) relative to the start of the struct, even though alignof(T)
is still only 4 for double
and int64_t
.
This is to optimize for the common case of separate allocation of 8-byte aligned memory for a single struct, without giving an alignment guarantee. i386 System V also has the same alignof(T) = 4
for most primitive types (but malloc
still gives you 8-byte aligned memory because alignof(maxalign_t) = 8
). But anyway, i386 System V doesn't have that struct-packing rule, so (if you don't arrange your struct from largest to smallest) you can end up with 8-byte members under-aligned relative to the start of the struct.
Most CPUs have addressing modes that, given a pointer in a register, allow access to any byte offset. The max offset is usually very large, but on x86 it saves code size if the byte offset fits in a signed byte ([-128 .. +127]
). So if you have a large array of any kind, prefer putting it later in the struct after the frequently used members. Even if this costs a bit of padding.
Your compiler will pretty much always make code that has the struct address in a register, not some address in the middle of the struct to take advantage of short negative displacements.
Eric S. Raymond wrote an article The Lost Art of Structure Packing. Specifically the section on Structure reordering is basically an answer to this question.
He also makes another important point:
9. Readability and cache locality
While reordering by size is the simplest way to eliminate slop, it’s not necessarily the right thing. There are two more issues: readability and cache locality.
In a large struct that can easily be split across a cache-line boundary, it makes sense to put 2 things nearby if they're always used together. Or even contiguous to allow load/store coalescing, e.g. copying 8 or 16 bytes with one (unaliged) integer or SIMD load/store instead of separately loading smaller members.
Cache lines are typically 32 or 64 bytes on modern CPUs. (On modern x86, always 64 bytes. And Sandybridge-family has an adjacent-line spatial prefetcher in L2 cache that tries to complete 128-byte pairs of lines, separate from the main L2 streamer HW prefetch pattern detector and L1d prefetching).
Fun fact: Rust allows the compiler to reorder structs for better packing, or other reasons. IDK if any compilers actually do that, though. Probably only possible with link-time whole-program optimization if you want the choice to be based on how the struct is actually used. Otherwise separately-compiled parts of the program couldn't agree on a layout.
(@alexis posted a link-only answer linking to ESR's article, so thanks for that starting point.)
Solution 2:
gcc has the -Wpadded
warning that warns when padding is added to a structure:
https://godbolt.org/z/iwO5Q3:
<source>:4:12: warning: padding struct to align 'X::b' [-Wpadded]
4 | double b;
| ^
<source>:1:8: warning: padding struct size to alignment boundary [-Wpadded]
1 | struct X
| ^
And you can manually rearrange members so that there is less / no padding. But this is not a cross platform solution, as different types can have different sizes / alignments on different system (Most notably pointers being 4 or 8 bytes on different architectures). The general rule of thumb is go from largest to smallest alignment when declaring members, and if you're still worried, compile your code with -Wpadded
once (But I wouldn't keep it on generally, because padding is necessary sometimes).
As for the reason why the compiler can't do it automatically is because of the standard ([class.mem]/19). It guarantees that, because this is a simple struct with only public members, &x.a < &x.c
(for some X x;
), so they can't be rearranged.
Solution 3:
There really isn't a portable solution in the generic case. Baring minimal requirements the standard imposes, types can be any size the implementation wants to make them.
To go along with that, the compiler is not allowed to reorder class member to make it more efficient. The standard mandates that the objects must be laid out in their declared order (by access modifier), so that's out as well.
You can use fixed width types like
struct foo
{
int64_t a;
int16_t b;
int8_t c;
int8_t d;
};
and this will be the same on all platforms, provided they supply those types, but it only works with integer types. There are no fixed-width floating point types and many standard objects/containers can be different sizes on different platforms.
Solution 4:
Mate, in case you have 3GB of data, you probably should approach an issue by other way then swapping data members.
Instead of using 'array of struct', 'struct of arrays' could be used. So say
struct X
{
int a;
double b;
int c;
};
constexpr size_t ArraySize = 1'000'000;
X my_data[ArraySize];
is going to became
constexpr size_t ArraySize = 1'000'000;
struct X
{
int a[ArraySize];
double b[ArraySize];
int c[ArraySize];
};
X my_data;
Each element is still easily accessible mydata.a[i] = 5; mydata.b[i] = 1.5f;...
.
There is no paddings (except a few bytes between arrays). Memory layout is cache friendly. Prefetcher handles reading sequential memory blocks from a few separate memory regions.
That's not as unorthodox as it might looks at first glance. That approach is widely used for SIMD and GPU programming.
Array of Structures (AoS), Structure of Arrays