Why does breaking the "output dependency" of LZCNT matter?
While benchmarking something I measured a much lower throughput than I had calculated, which I narrowed down to the LZCNT instruction (it also happens with TZCNT), as demonstrated in the following benchmarks:
xor ecx, ecx
_benchloop:
lzcnt eax, edx
add ecx, 1
jnz _benchloop
And:
xor ecx, ecx
_benchloop:
xor eax, eax ; this shouldn't help, but it does
lzcnt eax, edx
add ecx, 1
jnz _benchloop
The second version is much faster. It shouldn't be. There's no reason why LZCNT should have an input dependency on its output. Unlike BSR/BSF, the xZCNT instructions always overwrite their output.
I'm running this on a 4770K, so LZCNT and TZCNT aren't being executed as BSR/BSF.
What's going on here?
This is simply a limitation in the micro-architecture of your Intel Haswell CPU and several previous1 CPUs. It has been fixed for tzcnt
and lzcnt
as of Skylake-S (client), but the issue remained for popcnt
until it was fixed in Cannon Lake.
On those micro-architectures the destination operand for tzcnt
, lzcnt
and popcnt
is treated as an input dependency even though, semantically, it is not. Now I doubt this is a really a "bug": if it was simply an unintended behavior/oversight, I expect it would have been fixed in one of the several new micro-architectures that have been released since it was introduced.
Most likely it is a design compromise based on one or both of the following two factors:
-
The hardware for
popcnt
,lzcnt
andtzcnt
is likely all shared with the existingbsf
andbsr
instructions. Nowbsf
andbsr
do have a dependency on the previous destination value in practice2 for the special case of all-bits-zero input, since Intel chips leave the destination unmodified in that case. So it is entirely possible that the simplest design for the combined hardware resulted in the other similar instructions executing on the same unit inheriting the same dependency. -
The vast majority of x86 two operand ALU instructions have an dependency on the destination operand, since it is used as a source as well. The three affected instructions are somewhat unique in that they are unary operators, but unlike existing unary operators like
not
andneg
which have a single operand used as source and destination, they have distinct source and destination operands, making them superficially similar to most 2-input instructions. Perhaps the renamer/scheduler circuitry just doesn't distinguish the special case of these unary-with-two-register-operand versus the vast majority of plain shared source/destination 2-input instructions which don't have this dependency.
In fact, for the case of popcnt
Intel has issued various errata covering the false dependency issue such as HSD146 for Haswell Desktop and SKL029 for Skylake, which reads:
POPCNT Instruction May Take Longer to Execute Than Expected
Problem POPCNT instruction execution with a 32 or 64 bit operand may be delayed until previous non -dependent instructions have executed.
Implication Software using the POPCNT instruction may experience lower performance than expected.
Workaround None identified
I always found this erratum unusual since it isn't really identifying any type of functional defect or non-conformance to specification which is the case for essentially all the other errata. Intel doesn't really document a specific performance model for the OoO execution engine and there are a ton of other performance "gotchas" that have appeared and disappeared over the years (many with a much larger impact than this very minor issue) that don't get documented in errata. Still, this perhaps provides some evidence that it can be considered a bug. Oddly, the erratum was never extended to include tzcnt
or lzcnt
which had the same issue when they were introduced.
1 Well tzcnt
and lzcnt
only appeared in Haswell, but the problem exists for popcnt
as well which was introduced in Nehalem - but the false dependency problem perhaps only exists for Sandy Bridge or later.
2In practice, although not documented in the ISA docs, since the result for all-zero input was undefined in the Intel manuals. Most or all Intel chips implemented the behavior as leaving the destination register unchanged in this case, however.
AMD does document and guarantee that behaviour for bsf
and bsr
.
(But unfortunately those instructions are slower than tzcnt
/lzcnt
on AMD (extra uops, see https://uops.info/), so instead of taking advantage of that bsf
behaviour, it would often be better for AMD CPUs to use rep bsf
so it will decode as tzcnt
on CPUs that know about that instruction, and test
/cmov
if you have enough free registers. But bsr
gives different results to lzcnt
even for non-zero input, so you might consider taking advantage of it.)
Along the lines of what @BrettHale suggested, it's possible (if odd) that you're hitting a corner-case partial flags update stall. The flag state should in theory simply be renamed away because the following add updates all flags, but if it isn't for some reason then it would introduce a loop-carried dependency, and inserting xor would break that dependency.
It's hard to know for certain if this is what's happening, but it looks at a casual glance to be the most likely explanation; you can test the hypothesis by replacing the xor
with test
(which also breaks the flags dependency but has no effect on register dependencies).