x86 registers: MBR/MDR and instruction registers

From what I have read, the IA-32 architecture has ten 32-bit and six 16-bit registers.

The 32-bit registers are as follows:

  • Data registers - EAX, EBX, ECX, EDX
  • Pointer registers - EIP, ESP, EBP
  • Index registers - ESI, EDI
  • Control registers - EFLAG (EIP is also classified as a control register)

The 16-bit registers are as below:

  • Code Segment: It contains all the instructions to be executed.
  • Data Segment: It contains data, constants and work areas.
  • Stack Segment: It contains data and return addresses of procedures or subroutines.
  • Extra Segment (ES). Pointer to extra data.
  • F Segment (FS). Pointer to more extra data.
  • G Segment (GS). Pointer to still more extra data.

However, I can't find any information on the Current Instruction Register (CIR) or Memory Buffer Registers (MBR)/Memory Data Registers (MBR). Are these registers referred to as something else? And are these registers 32-bit?

I assume they are 32-bit and that most commonly used instructions under this architecture are under 4 bytes long. From observation, many instructions seem to be under 4 bytes, for example:

  • PUSH EBP (55)
  • MOV EBP, ESP (8B EC)
  • LEA (8D 44 38 02)

For longer instruction, the CPU will use prefix codes and other optional codes. Longer instructions will require more than one cycle to complete which will depend on instruction length.

Am I correct in that the registers in question are 32-bit in length? And are there any other registers in the IA-32 architecture that I should also be aware of?


Solution 1:

No, the registers you're talking about are an implementation detail that don't exist as physical registers in modern x86 CPUs.

x86 doesn't specify any of those implementation details you find in toy / teaching CPU designs. The x86 manuals only specify things that are architecturally visible.

Intel and AMD's optimization manuals go into some detail about the internal implementation, and it's nothing like what you're suggesting. Modern x86 CPUs rename the architectural registers onto much larger physical register files, enabling out-of-order execution without stalling for write-after-write or write-after-read data hazards. (See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for more details about register renaming). See this answer for a basic intro to out-of-order exec, and a block diagram of an actual Haswell core. (And remember that a physical chip has multiple cores).


Unlike a simple or toy microarchitecture, almost all high-performance CPUs support miss under miss and/or hit under miss (multiple outstanding cache misses, not totally blocking memory operations waiting for the first one to complete)


You could build a simple x86 that had a single MBR / MDR; I wouldn't be surprised if original 8086 and maybe 386 microarchitectures had something like that as part of the internal implementation.

But for example a Haswell or Skylake core can do 2 loads and 1 store per cycle from/to L1d cache (See How can cache be that fast?). Obviously they can't have just one MBR. Instead, Haswell has 72 load-buffer entries and 42 store-buffer entries, which all together are part of the Memory Order Buffer which supports out-of-order execution of loads / stores while maintaining the illusion that only StoreLoad reordering happens / is visible to other cores.

Since P5 Pentium, naturally-aligned loads/stores up to 64 bits are guaranteed atomic, but before that only 32-bit accesses were atomic. So yes, if 386/486 had an MDR, it could have been 32 bits. But even those early CPUs could have cache between the CPU and RAM.

We know that Haswell and later have a 256-bit path between L1d cache and execution units, i.e. 32 bytes, and Skylake-AVX512 has 64-byte paths for ZMM loads/stores. AMD CPUs split wide vector ops into 128-bit chunks, so their load/store buffer entries are presumably only 16 bytes wide.

Intel CPUs at least merge adjacent stores to the same cache line within the store buffer, and there are also the 10 LFBs (line-fill buffers) for pending transfers between L1d and L2 (or off-core to L3 or DRAM).


Instruction decoding: x86 is variable-length

x86 is a variable-length instruction set; after prefixes, the longest instruction is longer than 32 bits. This was true even for 8086. For example, add word [bx+disp16], imm16 is 6 bytes long. But 8088 only had a 4-byte prefetch queue to decode from (vs. 8086's 6 byte queue), so it had to support decoding instructions without having loaded the whole thing from memory. 8088 / 8086 decoded prefixes 1 cycle at a time, and 4 bytes of opcode + modRM is definitely enough to identify the length of the rest of the instruction, so it could decode it and then fetch the disp16 and/or imm16 if they weren't fetched yet. Modern x86 can have much longer instructions, especially with SSSE3 / SSE4 requiring many mandatory prefixes as part of the opcode.

It's also a CISC ISA, so keeping around the actual instruction bytes internally isn't very useful; you can't use the instruction bits directly as internal control signals the way you can with a simple MIPS.

In a non-pipelined CPU, yes there might be a single physical EIP register somewhere. For modern CPUs, each instruction has an EIP associated with it, but many are in flight at once inside the CPU. An in-order pipelined CPU might associate an EIP with each stage, but an out-of-order CPU would have to track it on a per-instruction basis. (Actually per uop, because complex instructions decode to more than 1 internal uop.)

Modern x86 fetches and decodes in blocks of 16 or 32 bytes, decoding up to 5 or 6 instructions per clock cycle and placing the decode results in a queue for the front-end to issue into the out-of-order part of the core.

See also the CPU-internals links in https://stackoverflow.com/tags/x86/info, especially David Kanter's write-ups and Agner Fog's microarch guides.


BTW, you left out x86's many control / debug registers. CR0..4 are critical for 386 to enable protected mode, paging, and various other stuff. You could use a CPU in real mode only using the GP and segment regs, and EFLAGS, but x86 has far more architectural registers if you include the non-general-purpose regs that the OS needs to manage.