Why NASM on Linux changes registers in x86_64 assembly

Solution 1:

In 64-bit mode mov eax, 1 will clear the upper part of the rax register (see here for an explanation) thus mov eax, 1 is semantically equivalent to mov rax, 1.

The former however spare a REX.W (48h numerically) prefix (a byte necessary to specify the registers introduced with x86-64), the opcode is the same for both instructions (0b8h followed by a DWORD or a QWORD).
So the assembler goes ahead and picks up the shortest form.

This is a typical behavior of NASM, see Section 3.3 of the NASM's manual where the example of [eax*2] is assembled as [eax+eax] to spare the disp32 field after the SIB byte1 ([eax*2] is only encodable as [eax*2+disp32] where the assembler set disp32 to 0).

I was unable to force NASM to emit a real mov rax, 1 instruction (i.e. 48 B8 01 00 00 00 00 00 00 00) even by prefixing the instruction with o64.
If a real mov rax, 1 is needed (this is not your case), one must resort to assembling it manually with db and similar.

EDIT: Peter Cordes' answer shows that there is, in fact, a way to tell NASM not to optimize an instruction with the strict modifier.
mov rax, STRICT 1 produces the 10-byte version of the instruction (mov r64, imm64) while mov rax, STRICT DWORD 1 produces a 7-byte version (mov r64, imm32 where imm32 is sign-extended before use).


Side note: It's better to use the RIP-relative addressing, this avoids 64-bit immediate constants (thus reducing code size) and is mandatory in MacOS (in case you cared).
Change the mov esi, msg to lea esi, [REL msg] (RIP-relative is an addressing mode so it needs an "addressing", the square bracket, to avoid reading from that address we use lea that only computes the effective address but does no access).
You can use the directive DEFAULT REL to avoid typing REL in each memory access.

I was under the impression that the Mach-O file format required PIC code but this may not be the case.


1 The Scale Index Base byte, used to encode the new addressing mode introduced back then with the 32-bit mode.

Solution 2:

TL:DR: You can override this with

  • mov eax, 1 (explicitly use the optimal operand-size)
    b8 01 00 00 00
  • mov rax, strict dword 1 (sign-extended 32-bit immediate)
    48 c7 c0 01 00 00 00
  • mov rax, strict qword 1 (64-bit immediate like movabs in AT&T syntax)
    48 b8 01 00 00 00 00 00 00 00
    (Also mov rax, strict 1 is equivalent to this, and is what you get if you disable NASM optimization.)

This is a perfectly safe and useful optimization, similar to using an 8-bit immediate instead of a 32-bit immediate when you write add eax, 1.

NASM only optimizes when the shorter form of the instruction has an identical architectural effect, because mov eax,1 implicitly zeros the upper 32 bits of RAX. Note that add rax, 0 is different from add eax, 0 so NASM can't optimize that: Only instructions like mov r32,... / mov r64,... or xor eax,eax that don't depend on the old value of the 32 vs. 64-bit register can be optimized this way.

You can disable it with nasm -O1 (the default is -Ox multipass), but note that you'll get 10-byte mov rax, strict qword 1 in that case: clearly NASM isn't intended to really be used with less than normal optimization. There isn't a setting where it will use the shortest encoding that wouldn't change the disassembly (e.g. 7-byte mov rax, sign_extended_imm32 = mov rax, strict dword 1).

The difference between -O0 and -O1 is in imm8 vs. imm32, e.g. add rax, 1 is
48 83 C0 01 (add r/m64, sign_extended_imm8) with -O1, vs.
48 05 01000000 (add rax, sign_extended_imm32) with nasm -O0.
Amusingly it still optimized by picking the special-case opcode that implies an RAX destination instead of taking a ModRM byte. Unfortunately -O1 doesn't optimize immediate sizes for mov (where sign_extended_imm8 isn't possible.)

If you ever need a specific encoding somewhere, ask for it with strict instead of disabling optimization.


Note that YASM doesn't do this operand-size optimization, so it's a good idea to make the optimization yourself in the asm source, if you care about code-size (even indirectly for performance reasons) in code that could be assembled with other NASM-compatible assemblers.

For instructions where 32 and 64-bit operand size wouldn't be equivalent if you had very large (or negative) numbers, you need to use 32-bit operand-size explicitly even if you're assembling with NASM instead of YASM, if you want the size / performance advantage. The advantages of using 32bit registers/instructions in x86-64


For 32-bit constants that don't have their high bit set, zero or sign extending them to 64 bits gives an identical result. Thus it's a pure optimization to assemble mov rax, 1 to a 5-byte mov r32, imm32 (with implicit zero extension to 64 bits) instead of a 7-byte mov r/m64, sign_extended_imm32.

(See Difference between movq and movabsq in x86-64 for more details about the forms of mov x86-64 allows; AT&T syntax has a special name for the 10-byte immediate form but NASM doesn't.)

On all current x86 CPUs, the only performance difference between that and the 7-byte encoding is code-size, so only indirect effects like alignment and L1I$ pressure are a factor. Internally it's just a mov-immediate, so this optimization doesn't change the microarchitectural effect of your code either (except of course for code-size / alignment / how it packs in the uop cache).

The 10-byte mov r64, imm64 encoding is even worse for code size. If the constant actually has any of its high bits set, then it has extra inefficiency in the uop cache on Intel Sandybridge-family CPUs (using 2 entries in the uop cache, and maybe an extra cycle to read from the uop cache). But if the constant is in the -2^31 .. +2^31 range (signed 32-bit), it's stored internally just as efficiently, using only a single uop-cache entry, even if it was encoded in the x86 machine code using a 64-bit immediate. (See Agner Fog's microarch doc, Table 9.1. Size of different instructions in μop cache in the Sandybridge section)

From How many ways to set a register to zero?, you can force any of the three encodings:

mov    eax, 1                ; 5 bytes to encode (B8 imm32)
mov    rax, strict dword 1   ; 7 bytes: REX mov r/m64, sign-extended-imm32.    NASM optimizes mov rax,1 to the 5B version, but dword or strict dword stops it for some reason
mov    rax, strict qword 1   ; 10 bytes to encode (REX B8 imm64).  movabs mnemonic for AT&T.  Normally assemblers choose smaller encodings if the operand fits, but strict qword forces the imm64.

Note that NASM used the 10-byte encoding (which AT&T syntax calls movabs, and so does objdump in Intel-syntax mode) for an address which is a link-time constant but unknown at assemble time.

YASM chooses mov r64, imm32, i.e. it assumes a code-model where label addresses are 32 bits, unless you use mov rsi, strict qword msg

YASM's behaviour is normally good (although using mov r32, imm32 for static absolute addresses like C compilers do would be even better). The default non-PIC code-model puts all static code/data in the low 2GiB of virtual address space, so zero- or sign-extended 32-bit constants can hold addresses.

If you want 64-bit label addresses you should normally use lea r64, [rel address] to do a RIP-relative LEA. (On Linux at least, position-dependent code can go in the low 32, so unless you're using the large / huge code models, any time you need to care about 64-bit label addresses, you're also making PIC code where you should use RIP-relative LEA to avoid needing text relocations of absolute address constants).

i.e. gcc and other compilers would have used mov esi, msg, or lea rsi, [rel msg], never mov rsi, msg.
See How to load address of function or label into register