Why can I access lower dword/word/byte in a register but not higher?

I started to learn assembler, and this does not looks logical to me.

Why can't I use multiple higher bytes in a register?

I understand the historical reason of rax->eax->ax, so let's focus on new 64-bit registers. For example, I can use r8 and r8d, but why not r8dl and r8dh? The same goes with r8w and r8b.

My initial thinking was that I can use 8 r8b registers at the same time (like I can do with al and ah at the same time). But I can't. And using r8b makes the complete r8 register "busy".

Which raises the question - why? Why would you need to use only a part of a register if you can't use other parts at the same time? Why not just keep only r8 and forget about the lower parts?


Solution 1:

why can't I use multiple higher bytes in a register

Every permutation of an instruction needs to be encoded in the instruction. The original 8086 processor supports the following options:

instruction     encoding    remarks
---------------------------------------------------------
mov ax,value    b8 01 00    <-- whole register
mov al,value    b4 01       <-- lower byte
mov ah,value    b0 01       <-- upper byte

Because the 8086 is a 16 bit processor three different versions cover all options.
In the 80386 32-bit support was added. The designers had a choice, either add support for 3 additional sets of registers (x 8 registers = 24 new registers) and somehow find encodings for these, or leave things mostly as they were before.

Here's what the designers opted for:

instruction     encoding           remarks
---------------------------------------------------------
mov eax,value    b8 01 00 00 00    (same encoding as mov ax,value!)
mov ax,value     66 b8 01 00       (prefix 66 + encoding for mov eax,value)
mov al,value     (same as before)
mov ah,value     (same as before)

They simply added a 0x66 prefix to change the register size from the (now) default 32 to 16 bit plus a 0x67 prefix to change the memory operand size. And left it at that.

To do otherwise would have meant doubling the number of instruction encodings or add three six new prefixes for each of your 'new' partial registers.
By the time the 80386 came out all instruction bytes were already taken, so there was no space for new prefixes. This opcode space had been eaten up by useless instructions like AAA, AAD, AAM, AAS, DAA, DAS SALC. (These have been disabled in X64 mode to free up much needed encoding space).

If you want to change only the higher bytes of a register, simply do:

movzx eax,cl     //mov al,cl, but faster   
shl eax,24       //mov al to high byte.

But why not two (say r8dl and r8dh)

In the original 8086 there were 8 byte sized registers:

al,cl,dl,bl,ah,ch,dh,bh  <-- in this order.

The index registers, base pointer and stack reg do not have byte registers.

In the x64 this was changed. If there is a REX prefix (denoting x64 registers) then al..bh (8 regs) encode al..r15l. 16 regs incl. 1 extra encoding bit from the rex prefix. This adds spl, dil, sil, bpl, but excludes any xh reg. (you can still get the four xh regs when not using a rex prefix).

And using r8b makes the complete r8 "busy"

Yes, this is called a 'partial register write'. Because writing r8b changes part, but not all of r8, r8 is now split into two halves. One half has changed and one half has not. The CPU needs to join the two halves. It can either do this by using an extra CPU cycle to perform the work, or by adding more circuitry to the task to be able to do it in a single cycle.
The latter is expensive in terms of silicon and complex in terms of design, it also adds extra heat because of the extra work being done (more work per cycle = more heat produced). See Why doesn't GCC use partial registers? for a run-down on how different x86 CPUs handle partial-register writes (and later reads of the full register).

if I use r8b I can't access upper 56 bits at the same time, they exist, but unaccessible

No they are not unaccessible.

mov  rax,bignumber         //random value in eax
mov  al,0                  //clear al
xor  r8d,r8d               //r8=0
mov  r8b,16                //set r8b
or   r8,rax                //change r8 upper without changing r8b  

You use masks plus and, or, xor and not and to change parts of a register without affecting the rest of it.

There really was never a need for ah, but it did lead to more compact code on 8086 (and effectively more usable registers). It's still sometimes useful to write EAX or RAX and then read AL and AH separately (e.g. movzx ecx, al / movzx edx, ah) as part of unpacking bytes.

Solution 2:

The general answer is that such access is costly in a few senses and rarely needed.

Since at least second half of 1980s, and deeply since 1990s, instruction sets are modelled mainly for compiler convenience, than human convenience. A compiler logic is much simpler when it projects set of variables with its defined sizes (8, 16, 32, 64 bits) onto a fixed set of registers, and each register is used exactly for one value at a time. Register overlap is very confusing to them. As result, compiler internally knows a single register "A" (or even R0) that is AL, AX, EAX or RAX, depending on operand size. To use AH, it shall get into attention that AX consists of AH and AL, which is out of current sight. Even if it generates instructions with AH (e.g. LAHF), internally it is likely treated as "operation that fills A with LowFlags*256". (In real, there are some hacks that smear this strong picture, but they are very local.)

This is merged with other compiler specifics. For example, GCC and Clang are deeply SSA based. As result, you will never see XCHG instruction in their output; if you found it somewhere in code, it's 100% manual-written assembly insertion. The same for RCL, RCR, even if they are suitable in some specific cases (e.g. divide uint32 by 7), likely for ROL, ROR. If AMD had dropped RCL, RCR from their x86-64 design, nobody would really have mourned these instructions.

This does not include vector facility that is modelled on different principles and orthogonal to the main one. When compiler decides to do 4 parallel uint32 actions on an XMM register, it can use PINS* instructions to replace a part of such register or PEXTR* to extract it, but, in that case, it tracks 2-4-8-16... values at a moment. But such vectorization doesn't apply to the main register set, at least in main state-of-the-art ISAs.

This movement in compilers has been having an ongoing and strengthening moving in hardware. It's easier to make 16-32 independent architectural registers and track (see register renaming) them individually (e.g. add 2 register sources and provide 1 register result) than provide each part of register separately and count an instruction that (for the same example) gets 16 single-byte sources and generate 8 single-byte results. (Thatʼs why x86-64 is designed that an 32-bit register write clears upper 32 bits of 64-bit register; but this is not done for 8- and 16-bit operations, because CPU has already got need to combine with upper bits of previous register value, for legacy reasons.)

There are some chances to see this changed in some future before a radical CPU design revolution, but I treat them as really minimal.

If you currently need access to part of registers, like e.g. bits 40-47 of RAX, this can be quite easily implemented with copyings and rotations. To extract it:

MOV RCX, RAX ; expect result in CL
SHR RCX, 40
MOVZX RCX, CL ; to clear all bits except 7-0

To replace value:

ROR RAX, 40
MOV AL, CL ; provided that CL is what to insert
ROL RAX, 40

these code chunks are linear and fast enough.

Solution 3:

There is another step in the history, the 8-bit 8080 that came before the 8086. Despite it being an 8-bit processor, you could use pairs of 8-bit registers to perform some 16-bit operations.

https://en.wikipedia.org/wiki/Intel_8080#Registers

So to make it easier to convert 8080 assembly code to 8086 code - which seemed important at the time (Intel even supplied a program to do that automatically, almost) - the new 16-bit registers were designed to optionally be used as pairs of 8-bit registers.

However, in the 8086 there were no features to use pairs of 16-bit registers for 32-bit operations, so when the 386 came around there didn't seem to be a need for splitting 32-bit registers into two 16-bit registers.

As Johan shows, the instruction set still provides a way to get two 8-bit registers from the lowest 16-bits. But this (mis)feature was not extended to higher widths.

Likewise, when moving to 64 bits there is no precedent of using pairs of 32-bit registers for 64-bit operations (except for some odd double shifts). And nobody tries to convert old assembly code anymore. Never worked that well anyway.