Why can't GCC generate an optimal operator== for a struct of two int32s?

Solution 1:

If you "fix" the alignment, all give the same assembly language output (with GCC):

struct alignas(std::int64_t) Point {
    std::int32_t x, y;
};

Demo

As a note, some correct/legal ways to do some stuff (as type punning) is to use memcpy, so having specific optimization (or be more aggressive) when using that function seems logical.

Solution 2:

There's a performance cliff you risk falling off of when implementing this as a single 64-bit comparison:

You break store to load forwarding.

If the 32-bit numbers in the structs are written to memory by separate store instructions, and then loaded back from memory with 64-bit load instructions quickly (before the stores hit L1$), your execution will stall until the stores commit to globally visible cache coherent L1$. If the loads are 32-bit loads that match the previous 32-bit stores, modern CPUs will avoid the store-load stall by forwarding the stored value to the load instruction before the store reaches cache. This violates sequential consistency if multiple CPUs access the memory (a CPU sees its own stores in a different order than other CPUs do), but is allowed by most modern CPU architectures, even x86. The forwarding also allows much more code to be executed completely speculatively, because if the execution has to be rolled back, no other CPU can have seen the store for the code that used the loaded value on this CPU to be speculatively executed.

If you want this to use 64-bit operations and you don't want this perf cliff, you may want to ensure the struct is also always written as a single 64-bit number.

Solution 3:

Why can't the compiler generate [same assembly as memcpy version]?

The compiler "could" in the sense that it would be allowed to.

The compiler simply doesn't. Why it doesn't is beyond my knowledge as that requires deep knowledge of how the optimiser has been implemented. But, the answer may range from "there is no logic covering such transformation" to "the rules aren't tuned to assume one output is faster than the other" on all target CPUs.

If you use Clang instead of GCC, you'll notice that it produces same output for naiveEqual and naiveEqual1 and that assembly has no jump. It is same as for the "optimised" version except for using two 32 bit instructions in place of one 64 bit instruction. Furthermore restricting the alignment of Point as shown in Jarod42's answer has no effect to the optimiser.

MSVC behaves like Clang in the sense that it is unaffected by the alignment, but differently in the sense that it doesn't get rid of the jump in naiveEqual.

For what its worth, the compilers (I checked GCC and Clang) produce essentially same output for the C++20 defaulted comparison as they do fornaiveEqual. For whatever reason, GCC opted to use jne instead of je for the jump.

is this a missing compiler optimization

With the assumption that one is always faster than the other on the target CPUs, that would be fair conclusion.

Why can't GCC generate an optimal operator== for a struct of two int32s?

Solution 1:

Solution 2:

Solution 3:

Related

Recent Posts