Are loads and stores the only instructions that gets reordered?

Solution 1:

Out-of-order execution preserves the illusion of running in program order for a single thread/core. This is like the C/C++ as-if optimization rule: do whatever you want internally as long as visible effects are the same.

Separate threads can only communicate with each other via memory, so the global order of memory operations (loads/stores) is the only externally visible side-effect of execution1.

Even in-order CPUs can have their memory operations become globally visible out of order. (e.g. even a simple RISC pipeline with a store buffer will have StoreLoad reordering, like x86). A CPU that starts loads / stores in-order but allows them to complete out of order (to hide cache-miss latency) could also reorder loads if it doesn't specifically avoid it (or like modern x86, execute aggressively out-of-order but pretend that it doesn't by tracking memory ordering carefully).


A simple example: two ALU dependency chains can overlap

(related: http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ for more about how big the window is for finding instruction-level parallelism, e.g. if you increased this to times 200 you'd see only limited overlap. Also related: this beginner to intermediate-level answer I wrote about how an OoO CPU like Haswell or Skylake finds and exploits ILP.)

See also Modern Microprocessors A 90-Minute Guide! for an excellent into to superscalar and out-of-order exec CPUs.

For a much deeper analysis of the impact of lfence here, see Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths

global _start
_start:
    mov  ecx, 10000000
.loop:
    times 25 imul eax,eax   ; expands to imul eax,eax  / imul eax,eax / ...
 ;   lfence
    times 25 imul edx,edx
 ;   lfence
    dec  ecx
    jnz  .loop

    xor  edi,edi
    mov  eax,231
    syscall          ; sys_exit_group(0)

built (with nasm + ld) into a static executable on x86-64 Linux, this runs (on Skylake) in the expected 750M clock cycles for each chain of 25 * 10M imul instructions times 3 cycle latency.

Commenting out one of the imul chains doesn't change the time it takes to run: still 750M cycles.

This is definite proof of out-of-order execution interleaving the two dependency chains, otherwise . (imul throughput is 1 per clock, latency 3 clocks. http://agner.org/optimize/. So a third dependency chain could be mixed in without much slowdown).

Actual numbers from taskset -c 3 ocperf.py stat --no-big-num -etask-clock,context-switches,cpu-migrations,page-faults,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u,uops_retired.retire_slots:u -r3 ./imul:

  • with both imul chains: 750566384 +- 0.1%
  • with only the EAX chain: 750704275 +- 0.0%
  • with one times 50 imul eax,eax chain: 1501010762 +- 0.0% (almost exactly twice as slow, as expected).
  • with lfence preventing overlap between each block of 25 imul: 1688869394 +- 0.0%, worse than twice as slow. uops_issued_any and uops_retired_retire_slots are both 63M, up from 51M, while uops_executed_thread is still 51M (lfence doesn't use any execution ports, but apparently two lfence instructions cost 6 fused-domain uops each. Agner Fog only measured 2.)

(lfence serializes instruction execution, but not memory stores). If you aren't using NT loads from WC memory (which won't happen by accident), it's a no-op other than stopping later instructions from executing until previous instructions have "completed locally". i.e. until they've retired from the out-of-order core. This is probably why it more than doubles the total time: it has to wait for the last imul in a block to go through more pipeline stages.)

lfence on Intel is always like that, but on AMD it's only partially-serializing with Spectre mitigation enabled.


Footnote 1: There are also timing side-channels when two logical threads share one physical thread (hyperthreading or other SMT). e.g. executing a sequence of independent imul instructions will run at 1 per clock on a recent Intel CPU, if the other hyperthread doesn't need port 1 for anything. So you can measure how much port 0 pressure there is by timing an ALU-bound loop on once logical core.

Other micro-architectural side-channels, such as cache accesses, are more reliable. For example, Spectre / Meltdown are easiest to exploit with a cache-read side-channel, rather than ALU.

But all of these side-channels are finicky and unreliable compared to architecturally-supported reads/writes to shared memory, so they're only relevant for security. They're not used intentionally within the same program for communicating between threads.


MFENCE on Skylake is an OoO exec barrier like LFENCE

mfence on Skylake unexpectedly blocks out-of-order execution of imul, like lfence, even though it's not documented to have that effect. (See the moved-to-chat discussion for more).

xchg [rdi], ebx (implicit lock prefix) doesn't block out-of-order execution of ALU instructions at all. The total time is still 750M cycles when replacing lfence with xchg or a locked instruction in the above test.

But with mfence, the cost goes up to 1500M cycles + the time for 2 mfence instructions. To do a controlled experiment, I kept the instruction-count the same but moved the mfence instructions next to each other, so the imul chains could reorder with each other, and the time went down to 750M + the time for 2 mfence instructions.

This Skylake behaviour is very likely the result of a microcode update to fix erratum SKL079, MOVNTDQA From WC Memory May Pass Earlier MFENCE Instructions. The existence of the erratum shows that it used to be possible to execute later instructions before mfence completed, so probably they did a brute-force fix of adding lfence uops to the microcode for mfence.

This is another factor in favour of using xchg for seq-cst stores, or even lock add to some stack memory as a stand-alone barrier. Linux already does both of those things, but compilers still use mfence for barriers. See Why does a std::atomic store with sequential consistency use XCHG?

(See also discussion about Linux's barrier choices on this Google Groups thread, with links to 3 separate recommendations for using lock addl $0, -4(%esp/rsp) instead of mfence as a stand-alone barrier.

Solution 2:

Out of order processors can generally reorder all instruction where doing so is possible, feasible, beneficial for performance. Due to register renaming, this is transparent to the machine code except for the case of loads and stores That's why people usually only talk about load and store reordering as that's the only observable kind of reordering.


 Typically, FPU exceptions are also something where you can observe reordering. Most out of order processors have imprecise exceptions for this reason, but not x86. On x86, the processor makes sure that exceptions are reported as if floating point operations were not reordered.