Why doesn't GCC use partial registers?
Solution 1:
Yes, GCC generally avoids writing to partial registers, unless optimizing for size (-Os
) instead of purely speed (-O3
). Some cases require writing at least the 32-bit register for correctness, so a better example would be something like:
char foo(char *p) { return *p; }
compiles to movzx eax, byte ptr [rdi]
instead of mov al, [rdi]
. https://godbolt.org/z/4ca9cTG9j
But GCC doesn't always avoid partial registers, sometimes even causing partial-register stalls https://gcc.gnu.org/bugzilla/show_bug.cgi?id=15533
Writing partial registers entails a performance penalty on many x86 processors because they are renamed into different physical registers from their whole counterpart when written. (For more about register renaming enabling out-of-order execution, see this Q&A).
But when an instruction reads the whole register, the CPU has to detect the fact that it doesn't have the correct architectural register value available in a single physical register. (This happens in the issue/rename stage, as the CPU prepares to send the uop into the out-of-order scheduler.)
It's called a partial register stall. Agner Fog's microarchitecture manual explains it pretty well:
6.8 Partial register stalls (PPro/PII/PIII and early Pentium-M)
Partial register stall is a problem that occurs when we write to part of a 32-bit register and later read from the whole register or a bigger part of it.
Example:
; Example 6.10a. Partial register stall
mov al, byte ptr [mem8]
mov ebx, eax ; Partial register stall
This gives a delay of 5 - 6 clocks. The reason is that a temporary register has been assigned to
AL
to make it independent ofAH
. The execution unit has to wait until the write toAL
has retired before it is possible to combine the value fromAL
with the value of the rest ofEAX
.
Behaviour in different CPUs:
- Intel early P6 family: see above: stall for 5-6 clocks until the partial writes retire.
- Intel Pentium-M (model D) / Core2 / Nehalem: stall for 2-3 cycles while inserting a merging uop. (see this Q&A for a microbenchmark writing AX and reading EAX with or without xor-zeroing first)
- Intel Sandybridge: insert a merging uop for low8/low16 (AL/AX) without stalling, or for AH/BH/CH/DH while stalling for 1 cycle.
- Intel IvyBridge (maybe), but definitely Haswell / Skylake: AL/AX aren't renamed, but AH still is: How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent.
- All other x86 CPUs: Intel Pentium4, Atom / Silvermont / Knight's Landing. All AMD (and Via, etc):
Partial registers are never renamed. Writing a partial register merges into the full register, making the write depend on the old value of the full register as an input.
Without partial-register renaming, the input dependency for the write is a false dependency if you never read the full register. This limits instruction-level parallelism because reusing an 8 or 16-bit register for something else is not actually independent from the CPU's point of view (16-bit code can access 32-bit registers, so it has to maintain correct values in the upper halves). And also, it makes AL and AH not independent. When Intel designed P6-family (PPro released in 1993), 16-bit code was still common, so partial-register renaming was an important feature to make existing machine code run faster. (In practice, many binaries don't get recompiled for new CPUs.)
That's why compilers mostly avoid writing partial registers. They use movzx
/ movsx
whenever possible to zero- or sign-extend narrow values to a full register to avoid partial-register false dependencies (AMD) or stalls (Intel P6-family). Thus most modern machine code doesn't benefit much from partial-register renaming, which is why recent Intel CPUs are simplifying their partial-register renaming logic.
As @BeeOnRope's answer points out, compilers still read partial registers, because that's not a problem. (Reading AH/BH/CH/DH can add an extra cycle of latency on Haswell/Skylake, though, see the earlier link about partial registers on recent members of Sandybridge-family.)
Also note that write
takes arguments that, for an x86-64 typically configured GCC, need whole 32-bit and 64-bit registers so it couldn't simply be assembled into mov dl, 3
. The size is determined by the type of the data, not the value of the data.
Only 32-bit register writes implicitly zero-extend to the full 64-bit; writing 8 and 16-bit partial registers leave the upper bytes unchanged. (This makes it tricky for hardware to handle efficiently, which is why AMD64 didn't follow that pattern.)
Finally, in certain contexts, C has default argument promotions to be aware of, though this is not the case.
Actually, as RossRidge pointed out, the call was probably made without a visible prototype.
Your disassembly is misleading, as @Jester pointed out.
For example mov rdx, 3
is actually mov edx, 3
, although both have the same effect—that is, to put 3 in the whole rdx
.
This is true because an immediately value of 3 doesn't require sign-extension and a MOV r32, imm32
implicitly clears the upper 32 bits of the register.
Solution 2:
All three of the earlier answers are wrong in different ways.
The accepted answer by Margaret Bloom implies that partial register stalls are to blame. Partial register stalls are a real thing, but are unlikely to be relevant to GCC's decision here.
If GCC replaced mov edx,3
by mov dl,3
, then the code would just be wrong, because writes to byte registers (unlike writes to dword registers) don't zero the rest of the register. The parameter in rdx
is of type size_t
, which is 64 bits, so the callee will read the full register, which will contain garbage in bits 8 to 63. Partial register stalls are purely a performance issue; it doesn't matter how fast the code runs if it's wrong.
That bug could be fixed by inserting xor edx,edx
before mov dl,3
. With that fix, there is no partial register stall, because zeroing a full register with xor
or sub
and then writing to the low byte is special-cased in all CPUs that have the stalling problem. So partial register stalls are still irrelevant with the fix.
The only situation where partial register stalls would become relevant is if GCC happened to know that the register was zero, but it wasn't zeroed by one of the special-cased instructions. For example, if this syscall was preceded by
loop:
...
dec edx
jnz loop
then GCC could deduce that rdx
was zero at the point where it wants to put 3 in it, and mov dl,3
would be correct – but it would be a bad idea in general because it could cause a partial-register stall. (Here, it wouldn't matter because syscalls are so slow anyway, but I don't think GCC has a "slow function that there's no need to speed-optimize calls to" attribute in its internal type system.)
Why doesn't GCC emit xor
followed by a byte move, if not because of partial register stalls? I don't know but I can speculate.
It only saves space when initializing r0
through r3
, and even then it only saves one byte. It increases the number of instructions, which has its own costs (the instruction decoders are frequently a bottleneck). It also clobbers the flags unlike the standard mov
, which means it isn't a drop-in replacement. GCC would have to track a separate flag-clobbering register initialization sequence, which in most cases (11/15 of possible destination registers) would be unambiguously less efficient.
If you're aggressively optimizing for size, you can do push 3
followed by pop rdx
, which saves 2 bytes regardless of the destination register, and doesn't clobber the flags. But it is probably much slower because it writes to memory and has a false read-write dependence on rsp
, and the space savings seem unlikely to be worth it. (It also modifies the red zone, so it isn't a drop-in replacement either.)
supercat's answer says
Processor cores often include logic to execute multiple 32-bit or 64-bit instructions simultaneously, but may not include logic to execute an 8-bit operation simultaneously with anything else. Consequently, while using 8-bit operations on the 8088 when possible was a useful optimization on the 8088, it can actually be a significant performance drain on newer processors.
Modern optimizing compilers actually use 8-bit GPRs quite a lot. (They use 16-bit GPRs relatively rarely, but I think that's because 16-bit quantities are uncommon in modern code.) 8-bit and 16-bit operations are at least as fast as 32-bit and 64-bit operations at most execution stages, and some are faster.
I previously wrote here "As far as I know, 8-bit operations are as fast as, or faster than, 32/64-bit operations on absolutely every 32/64 bit x86/x64 processor ever made." But I was wrong. Quite a few superscalar x86/x64 processors merge 8- and 16-bit destinations into the full register on every write, which means that write-only instructions like mov
have a false read dependency when the destination is 8/16 bits which doesn't exist when it's 32/64 bits. False dependency chains can slow execution if you don't clear the register before every move (or during, using something like movzx
). Newer processors have this problem even though the earliest superscalar processors (Pentium Pro/II/III) didn't have it. In spite of that, modern optimizing compilers do use the smaller registers in my experience.
BeeOnRope's answer says
The short answer for your particular case, is because gcc always sign or zero-extends arguments to 32-bits when calling a C ABI function.
But this function has no parameters shorter than 32 bits in the first place. File descriptors are exactly 32 bits long, and size_t
is exactly 64 bits long. It doesn't matter that many of those bits are often zero. They aren't variable-length integers that are encoded in 1 byte if they're small. It would only be correct to use mov dl,3
, with the rest of rdx
possibly being nonzero, for a parameter if there was no integer promotion requirement in the ABI and the actual parameter type was char
or some other 8-bit type.