Is garbage allowed in high bits of parameter and return value registers in x86-64 SysV ABI?

The x86-64 SysV ABI specifies, among other things, how function parameters are passed in registers (first argument in rdi, then rsi and so on), and how integer return values are passed back (in rax and then rdx for really big values).

What I can't find, however, is what the high bits of parameter or return value registers should be when passing types smaller than 64-bits.

For example, for the following function:

void foo(unsigned x, unsigned y);

... x will be passed in rdi and y in rsi, but they are only 32-bits. Do the high 32-bits of rdi and rsi need to be zero? Intuitively, I would assume yes, but the code generated by all of gcc, clang and icc has specific mov instructions at the start to zero out the high bits, so it seems like the compilers assume otherwise.

Similarly, the compilers seem to assume that the high bits of the return value rax may have garbage bits if the return value is smaller than 64-bits. For example, the loops in the following code:

unsigned gives32();
unsigned short gives16();

long sum32_64() {
  long total = 0;
  for (int i=1000; i--; ) {
    total += gives32();
  }
  return total;
}

long sum16_64() {
  long total = 0;
  for (int i=1000; i--; ) {
    total += gives16();
  }
  return total;
}

... compile to the following in clang (and other compilers are similar):

sum32_64():
...
.LBB0_1:                               
    call    gives32()
    mov     eax, eax
    add     rbx, rax
    inc     ebp
    jne     .LBB0_1


sum16_64():
...
.LBB1_1:
    call    gives16()
    movzx   eax, ax
    add     rbx, rax
    inc     ebp
    jne     .LBB1_1

Note the mov eax, eax after the call returning 32-bits, and the movzx eax, ax after the 16-bit call - both have the effect of zeroing out the top 32 or 48 bits, respectively. So this behavior has some cost - the same loop dealing with a 64-bit return value omits this instruction.

I've read the x86-64 System V ABI document pretty carefully, but I couldn't find whether this behavior documented in the standard.

What are the benefits of such a decision? It seems to me there are clear costs:

Parameter Costs

Costs are imposed on the implementation of callee when dealing with parameter values. and in the functions when dealing with the parameters. Granted, often this cost is zero because the function can effectively ignore the high bits, or the zeroing comes for free since 32-bit operand size instructions can be used which implicitly zero the high bits.

However, costs are often very real in the cases of functions that accept 32-bit arguments and do some math that could benefit from 64-bit math. Take this function for example:

uint32_t average(uint32_t a, uint32_t b) {
  return ((uint64_t)a + b) >> 2;
}

A straightforward use of 64-bit math to calculate a function that would otherwise have to carefully deal with overflow (the ability to transform many 32-bit functions in this way is an often unnoticed benefit of 64-bit architectures). This compiles to:

average(unsigned int, unsigned int):
        mov     edi, edi
        mov     eax, esi
        add     rax, rdi
        shr     rax, 2
        ret  

Fully 2 out of the 4 instructions (ignoring ret) are needed just to zero out the high bits. This may be cheap in practice with mov-elimination, but still it seems a big cost to pay.

On other hand, I can't really see a similar corresponding cost for the callers if the ABI were to specify that high bits are zero. Because rdi and rsi and the other parameter passing registers are scratch (i.e., can be overwritten by the caller), you only have a couple scenarios (we look at rdi, but replace it with the paramter reg of your choice):

  1. The value passed to the function in rdi is dead (not needed) in the post-call code. In that case, whatever instruction last assigned to rdi simply has to assign to edi instead. Not only is this free, it is often one byte smaller if you avoid a REX prefix.

  2. The value passed to the function in rdi is needed after the function. In that case, since rdi is caller-saved, the caller needs to do a mov of the value to a callee-saved register anyway. You can generally organize it so that the value starts in the callee saved register (say rbx) and then is moved to edi like mov edi, ebx, so it costs nothing.

I can't see many scenarios where the zeroing costs the caller much. Some examples would be if 64-bit math is needed in the last instruction which assigned rdi. That seems quite rare though.

Return value costs

Here the decision seems more neutral. Having callees clear out the junk has a definite code (you sometimes see mov eax, eax instructions to do this), but if garbage is allowed the costs shifts to the callee. Overall, it seems more likely that the caller can clear the junk for free, so allowing garbage doesn't seem overall detrimental to performance.

I suppose one interesting use-case for this behavior is that functions with varying sizes can share an identical implementation. For example, all of the following functions:

short sums(short x, short y) {
  return x + y;
}

int sumi(int x, int y) {
  return x + y;
}

long suml(long x, long y) {
  return x + y;
}

Can actually share the same implementation1:

sum:
        lea     rax, [rdi+rsi]
        ret

1 Whether such folding is actually allowed for functions that have their address taken is very much open to debate.


It looks like you have two questions here:

  1. Do the high bits of a return value need to be zeroed before returning? (And do the high bits of arguments need to be zeroed before calling?)
  2. What are the costs/benefits associated with this decision?

The answer to the first question is no, there can be garbage in the high bits, and Peter Cordes has already written a very nice answer on the subject.

As for the second question, I suspect that leaving the high bits undefined is overall better for performance. On one hand, zero-extending values beforehand comes at no additional cost when 32-bit operations are used. But on the other hand, zeroing the high bits beforehand is not always necessary. If you allow garbage in the high bits, then you can leave it up to the code that receives the values to only perform zero-extensions (or sign-extensions) when they are actually required.

But I wanted to highlight another consideration: Security

Information leaks

When the upper bits of a result are not cleared, they may retain fragments of other pieces of information, such as function pointers or addresses in the stack/heap. If there ever exists a mechanism to execute higher-privileged functions and retrieve the full value of rax (or eax) afterwards, then this could introduce an information leak. For example, a system call might leak a pointer from the kernel to user space, leading to a defeat of kernel ASLR. Or an IPC mechanism might leak information about another process' address space that could assist in developing a sandbox breakout.

Of course, one might argue that it is not the responsibility of the ABI to prevent information leaks; it is up to the programmer to implement their code correctly. While I do agree, mandating that the compiler zero the upper bits would still have the effect of eliminating this particular form of an information leak.

You shouldn't trust your input

On the other side of things, and more importantly, the compiler should not blindly trust that any received values have their upper bits zeroed out, or else the function may not behave as expected, and this could also lead to exploitable conditions. For example, consider the following:

unsigned char buf[256];
...
__fastcall void write_index(unsigned char index, unsigned char value) {
    buf[index] = value;
}

If we were allowed to assume that index has its upper bits zeroed out, then we could compile the above as:

write_index:  ;; sil = index, dil = value
      ; movzx esi, sil       ; skipped based on assumptions
    mov [buf + rsi], dil
    ret

But if we could call this function from our own code, we could supply a value of rsi out of the [0,255] range and write to memory beyond the bounds of the buffer.

Of course, the compiler would not actually generate code like this, since, as mentioned above, it is the responsibility of the callee to zero- or sign-extend its arguments, rather than that of the caller. This, I think, is a very practical reason to have the code that receives a value always assume that there is garbage in the upper bits and explicitly remove it.

(For Intel IvyBridge and later (mov-elimination), compilers would hopefully zero-extend into a different register to at least avoid the latency, if not the front-end throughput cost, of a movzx instruction.)