Is vxorps-zeroing on AMD Jaguar/Bulldozer/Zen faster with xmm registers than ymm?

AMD CPUs handle 256b AVX instructions by decoding into two 128b operations. e.g. vaddps ymm0, ymm1,ymm1 on AMD Steamroller decodes to 2 macro-ops, with half the throughput of vaddps xmm0, xmm1,xmm1.

XOR-zeroing is a special case (no input dependency, and on Jaguar at least avoids consuming a physical register file entry, and enables movdqa from that register to be eliminated at issue/rename, like Bulldozer does all the time even for non-zerod regs). But is it detected early enough that vxorps ymm0,ymm0,ymm0 still only decodes to 1 macro-op with equal performance to vxorps xmm0,xmm0,xmm0? (unlike vxorps ymm3, ymm2,ymm1)

Or does independence-detection happen later, after already decoding into two uops? Also, does vector xor-zeroing on AMD CPUs still use an execution port? On Intel-CPUs, Nehalem needs a port but Sandybridge-family handles it in the issue/rename stage.

Agner Fog's instruction tables don't list this special-case, and his microarch guide doesn't mention the number of uops.

This could mean vxorps xmm0,xmm0,xmm0 is a better way to implement _mm256_setzero_ps().

For AVX512, _mm512_setzero_ps() also saves a byte by using only a VEX-coded zeroing idiom, rather than EVEX, when possible. (i.e. for zmm0-15. vxorps xmm31,xmm31,xmm31 would still require an EVEX). gcc/clang currently use xor-zeroing idioms of whatever register-width they want, rather than always using AVX-128.

Reported as clang bug 32862 and gcc bug 80636. MSVC already uses xmm. Not yet reported to ICC, which also uses zmm regs for AVX512 zeroing. (Although Intel might not care to change since there's currently no benefit on any Intel CPUs, only AMD. If they ever release a low-power CPU that splits vectors in half, they might. Their current low-power deisgn (Silvermont) doesn't support AVX at all, only SSE4.)

The only possible downside I know of to using an AVX-128 instruction for zeroing a 256b register is that it doesn't trigger warm-up of the 256b execution units on Intel CPUs. Possibly defeating a C or C++ hack that tries to warm them up.

(256b vector instructions are slower for the first ~56k cycles after the first 256b instruction. See the Skylake section in Agner Fog's microarch pdf). It's probably ok if calling a noinline function that returns _mm256_setzero_ps isn't a reliable way to warm up the execution units. (One that still works without AVX2, and avoids any loads (that could cache miss) is __m128 onebits = _mm_castsi128_ps(_mm_set1_epi8(0xff));
return _mm256_insertf128_ps(_mm256_castps128_ps256(onebits), onebits) which should compile to pcmpeqd xmm0,xmm0,xmm0 / vinsertf128 ymm0,xmm0,1. That's still pretty trivial for something you call once to warm-up (or keep warm) the execution units well ahead of a critical loop. And if you want something that can inline, you probably need inline-asm.)

I don't have AMD hardware so I can't test this.

If anyone has AMD hardware but doesn't know how to test, use perf counters to count cycles (and preferably m-ops or uops or whatever AMD calls them).

This is the NASM/YASM source I use to test short sequences:

section .text
global _start
_start:

    mov     ecx, 250000000

align 32  ; shouldn't matter, but just in case
.loop:

    dec     ecx  ; prevent macro-fusion by separating this from jnz, to avoid differences on CPUs that can't macro-fuse

%rep 6
    ;    vxorps  xmm1, xmm1, xmm1
    vxorps  ymm1, ymm1, ymm1
%endrep

    jnz .loop

    xor edi,edi
    mov eax,231    ; exit_group(0) on x86-64 Linux
    syscall

If you're not on Linux, maybe replace the stuff after the loop (the exit syscall) with a ret, and call the function from a C main() function.

Assemble with nasm -felf64 vxor-zero.asm && ld -o vxor-zero vxor-zero.o to make a static binary. (Or use the asm-link script I posted in a Q&A about assembling static/dynamic binaries with/without libc).

Example output on an i7-6700k (Intel Skylake), at 3.9GHz. (IDK why my machine only goes up to 3.9GHz after it's been idle a few minutes. Turbo up to 4.2 or 4.4GHz works normally right after boot). Since I'm using perf counters, it doesn't actually matter what clock speed the machine is running. No loads/stores or code-cache misses are involved, so the number of core-clock-cycles for everything is constant regardless of how long they are.

$ alias disas='objdump -drwC -Mintel'
$ b=vxor-zero;  asm-link "$b.asm" && disas "$b" && ocperf.py stat -etask-clock,cycles,instructions,branches,uops_issued.any,uops_retired.retire_slots,uops_executed.thread -r4 "./$b"
+ yasm -felf64 -Worphan-labels -gdwarf2 vxor-zero.asm
+ ld -o vxor-zero vxor-zero.o

vxor-zero:     file format elf64-x86-64


Disassembly of section .text:

0000000000400080 <_start>:
  400080:       b9 80 b2 e6 0e          mov    ecx,0xee6b280
  400085:       66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00    data16 data16 data16 data16 data16 nop WORD PTR cs:[rax+rax*1+0x0]
  400094:       66 66 66 2e 0f 1f 84 00 00 00 00 00     data16 data16 nop WORD PTR cs:[rax+rax*1+0x0]

00000000004000a0 <_start.loop>:
  4000a0:       ff c9                   dec    ecx
  4000a2:       c5 f4 57 c9             vxorps ymm1,ymm1,ymm1
  4000a6:       c5 f4 57 c9             vxorps ymm1,ymm1,ymm1
  4000aa:       c5 f4 57 c9             vxorps ymm1,ymm1,ymm1
  4000ae:       c5 f4 57 c9             vxorps ymm1,ymm1,ymm1
  4000b2:       c5 f4 57 c9             vxorps ymm1,ymm1,ymm1
  4000b6:       c5 f4 57 c9             vxorps ymm1,ymm1,ymm1
  4000ba:       75 e4                   jne    4000a0 <_start.loop>
  4000bc:       31 ff                   xor    edi,edi
  4000be:       b8 e7 00 00 00          mov    eax,0xe7
  4000c3:       0f 05                   syscall

(ocperf.py is a wrapper with symbolic names for CPU-specific events.  It prints the perf command it actually ran):

perf stat -etask-clock,cycles,instructions,branches,cpu/event=0xe,umask=0x1,name=uops_issued_any/,cpu/event=0xc2,umask=0x2,name=uops_retired_retire_slots/,cpu/event=0xb1,umask=0x1,name=uops_executed_thread/ -r4 ./vxor-zero

 Performance counter stats for './vxor-zero' (4 runs):

        128.379226      task-clock:u (msec)       #    0.999 CPUs utilized            ( +-  0.07% )
       500,072,741      cycles:u                  #    3.895 GHz                      ( +-  0.01% )
     2,000,000,046      instructions:u            #    4.00  insn per cycle           ( +-  0.00% )
       250,000,040      branches:u                # 1947.356 M/sec                    ( +-  0.00% )
     2,000,012,004      uops_issued_any:u         # 15578.938 M/sec                   ( +-  0.00% )
     2,000,008,576      uops_retired_retire_slots:u # 15578.911 M/sec                   ( +-  0.00% )
       500,009,692      uops_executed_thread:u    # 3894.787 M/sec                    ( +-  0.00% )

       0.128516502 seconds time elapsed                                          ( +-  0.09% )

The +- 0.02% stuff is because I ran perf stat -r4, so it ran my binary 4 times.

uops_issued_any and uops_retired_retire_slots are fused-domain (front-end throughput limit of 4 per clock on Skylake and Bulldozer-family). The counts are nearly identical because there are no branch mispredicts (which lead to speculatively-issued uops being discarded instead of retired).

uops_executed_thread is unfused-domain uops (execution ports). xor-zeroing doesn't need any on Intel CPUs, so it's just the dec and branch uops that actually execute. (If we changed the operands to vxorps so it wasn't just zeroing a register, e.g. vxorps ymm2, ymm1,ymm0 to write the output to a register that the next one doesn't read, uops executed will match the fused-domain uop count. And we'd see that the throughput limit is three vxorps per clock.)

2000M fused-domain uops issued in 500M clock cycles is 4.0 uops issued per clock: achieving the theoretical max front-end throughput. 6 * 250 is 1500, so these counts match with Skylake decoding vxorps ymm,ymm,ymm to 1 fused-domain uop.

With a different number of uops in the loop, things aren't as good. e.g. a 5 uop loop only issued at 3.75 uops per clock. I intentionally chose this to be 8 uops (when vxorps decodes to a single-uop).

The issue-width of Zen is 6 uops per cycle, so it may do better with a different amount of unrolling. (See this Q&A for more about short loops whose uop count isn't a multiple of the issue width, on Intel SnB-family uarches).

Solution 1:

xor'ing a ymm register with itself generates two micro-ops on AMD Ryzen, while xor'ing an xmm register with itself generates only one micro-op. So the optimal way of xeroing a ymm register is to xor the corresponding xmm register with itself and rely on implicit zero extension.

The only processor that supports AVX512 today is Knights Landing. It uses a single micro-op for xor'ing a zmm register. It is very common to handle a new extension of vector size by splitting it in two. This happened with the transition from 64 to 128 bits and with the transition from 128 to 256 bits. It is more than likely that some processors in the future (from AMD or Intel or any other vendor) will split 512-bit vectors into two 256-bit vectors or even four 128-bit vectors. So the optimal way to zero a zmm register is to xor the 128-bit register with itself and rely on zero extension. And you are right, the 128-bit VEX-coded instruction is one or two bytes shorter.

Most processors recognize the xor of a register with itself to be independent of the previous value of the register.

Is vxorps-zeroing on AMD Jaguar/Bulldozer/Zen faster with xmm registers than ymm?

Solution 1:

Related

Recent Posts