Is garbage allowed in high bits of parameter and return value registers in x86-64 SysV ABI?
The x86-64 SysV ABI specifies, among other things, how function parameters are passed in registers (first argument in rdi
, then rsi
and so on), and how integer return values are passed back (in rax
and then rdx
for really big values).
What I can't find, however, is what the high bits of parameter or return value registers should be when passing types smaller than 64-bits.
For example, for the following function:
void foo(unsigned x, unsigned y);
... x
will be passed in rdi
and y
in rsi
, but they are only 32-bits. Do the high 32-bits of rdi
and rsi
need to be zero? Intuitively, I would assume yes, but the code generated by all of gcc, clang and icc has specific mov
instructions at the start to zero out the high bits, so it seems like the compilers assume otherwise.
Similarly, the compilers seem to assume that the high bits of the return value rax
may have garbage bits if the return value is smaller than 64-bits. For example, the loops in the following code:
unsigned gives32();
unsigned short gives16();
long sum32_64() {
long total = 0;
for (int i=1000; i--; ) {
total += gives32();
}
return total;
}
long sum16_64() {
long total = 0;
for (int i=1000; i--; ) {
total += gives16();
}
return total;
}
... compile to the following in clang
(and other compilers are similar):
sum32_64():
...
.LBB0_1:
call gives32()
mov eax, eax
add rbx, rax
inc ebp
jne .LBB0_1
sum16_64():
...
.LBB1_1:
call gives16()
movzx eax, ax
add rbx, rax
inc ebp
jne .LBB1_1
Note the mov eax, eax
after the call returning 32-bits, and the movzx eax, ax
after the 16-bit call - both have the effect of zeroing out the top 32 or 48 bits, respectively. So this behavior has some cost - the same loop dealing with a 64-bit return value omits this instruction.
I've read the x86-64 System V ABI document pretty carefully, but I couldn't find whether this behavior documented in the standard.
What are the benefits of such a decision? It seems to me there are clear costs:
Parameter Costs
Costs are imposed on the implementation of callee when dealing with parameter values. and in the functions when dealing with the parameters. Granted, often this cost is zero because the function can effectively ignore the high bits, or the zeroing comes for free since 32-bit operand size instructions can be used which implicitly zero the high bits.
However, costs are often very real in the cases of functions that accept 32-bit arguments and do some math that could benefit from 64-bit math. Take this function for example:
uint32_t average(uint32_t a, uint32_t b) {
return ((uint64_t)a + b) >> 2;
}
A straightforward use of 64-bit math to calculate a function that would otherwise have to carefully deal with overflow (the ability to transform many 32-bit functions in this way is an often unnoticed benefit of 64-bit architectures). This compiles to:
average(unsigned int, unsigned int):
mov edi, edi
mov eax, esi
add rax, rdi
shr rax, 2
ret
Fully 2 out of the 4 instructions (ignoring ret
) are needed just to zero out the high bits. This may be cheap in practice with mov-elimination, but still it seems a big cost to pay.
On other hand, I can't really see a similar corresponding cost for the callers if the ABI were to specify that high bits are zero. Because rdi
and rsi
and the other parameter passing registers are scratch (i.e., can be overwritten by the caller), you only have a couple scenarios (we look at rdi
, but replace it with the paramter reg of your choice):
The value passed to the function in
rdi
is dead (not needed) in the post-call code. In that case, whatever instruction last assigned tordi
simply has to assign toedi
instead. Not only is this free, it is often one byte smaller if you avoid a REX prefix.The value passed to the function in
rdi
is needed after the function. In that case, sincerdi
is caller-saved, the caller needs to do amov
of the value to a callee-saved register anyway. You can generally organize it so that the value starts in the callee saved register (sayrbx
) and then is moved toedi
likemov edi, ebx
, so it costs nothing.
I can't see many scenarios where the zeroing costs the caller much. Some examples would be if 64-bit math is needed in the last instruction which assigned rdi
. That seems quite rare though.
Return value costs
Here the decision seems more neutral. Having callees clear out the junk has a definite code (you sometimes see mov eax, eax
instructions to do this), but if garbage is allowed the costs shifts to the callee. Overall, it seems more likely that the caller can clear the junk for free, so allowing garbage doesn't seem overall detrimental to performance.
I suppose one interesting use-case for this behavior is that functions with varying sizes can share an identical implementation. For example, all of the following functions:
short sums(short x, short y) {
return x + y;
}
int sumi(int x, int y) {
return x + y;
}
long suml(long x, long y) {
return x + y;
}
Can actually share the same implementation1:
sum:
lea rax, [rdi+rsi]
ret
1 Whether such folding is actually allowed for functions that have their address taken is very much open to debate.
It looks like you have two questions here:
- Do the high bits of a return value need to be zeroed before returning? (And do the high bits of arguments need to be zeroed before calling?)
- What are the costs/benefits associated with this decision?
The answer to the first question is no, there can be garbage in the high bits, and Peter Cordes has already written a very nice answer on the subject.
As for the second question, I suspect that leaving the high bits undefined is overall better for performance. On one hand, zero-extending values beforehand comes at no additional cost when 32-bit operations are used. But on the other hand, zeroing the high bits beforehand is not always necessary. If you allow garbage in the high bits, then you can leave it up to the code that receives the values to only perform zero-extensions (or sign-extensions) when they are actually required.
But I wanted to highlight another consideration: Security
Information leaks
When the upper bits of a result are not cleared, they may retain fragments of other pieces of information, such as function pointers or addresses in the stack/heap. If there ever exists a mechanism to execute higher-privileged functions and retrieve the full value of rax
(or eax
) afterwards, then this could introduce an information leak. For example, a system call might leak a pointer from the kernel to user space, leading to a defeat of kernel ASLR. Or an IPC mechanism might leak information about another process' address space that could assist in developing a sandbox breakout.
Of course, one might argue that it is not the responsibility of the ABI to prevent information leaks; it is up to the programmer to implement their code correctly. While I do agree, mandating that the compiler zero the upper bits would still have the effect of eliminating this particular form of an information leak.
You shouldn't trust your input
On the other side of things, and more importantly, the compiler should not blindly trust that any received values have their upper bits zeroed out, or else the function may not behave as expected, and this could also lead to exploitable conditions. For example, consider the following:
unsigned char buf[256];
...
__fastcall void write_index(unsigned char index, unsigned char value) {
buf[index] = value;
}
If we were allowed to assume that index
has its upper bits zeroed out, then we could compile the above as:
write_index: ;; sil = index, dil = value
; movzx esi, sil ; skipped based on assumptions
mov [buf + rsi], dil
ret
But if we could call this function from our own code, we could supply a value of rsi
out of the [0,255]
range and write to memory beyond the bounds of the buffer.
Of course, the compiler would not actually generate code like this, since, as mentioned above, it is the responsibility of the callee to zero- or sign-extend its arguments, rather than that of the caller. This, I think, is a very practical reason to have the code that receives a value always assume that there is garbage in the upper bits and explicitly remove it.
(For Intel IvyBridge and later (mov-elimination), compilers would hopefully zero-extend into a different register to at least avoid the latency, if not the front-end throughput cost, of a movzx
instruction.)