Why does breaking the "output dependency" of LZCNT matter?

While benchmarking something I measured a much lower throughput than I had calculated, which I narrowed down to the LZCNT instruction (it also happens with TZCNT), as demonstrated in the following benchmarks:

  xor ecx, ecx
_benchloop:
  lzcnt eax, edx
  add ecx, 1
  jnz _benchloop

And:

  xor ecx, ecx
_benchloop:
  xor eax, eax  ; this shouldn't help, but it does
  lzcnt eax, edx
  add ecx, 1
  jnz _benchloop

The second version is much faster. It shouldn't be. There's no reason why LZCNT should have an input dependency on its output. Unlike BSR/BSF, the xZCNT instructions always overwrite their output.

I'm running this on a 4770K, so LZCNT and TZCNT aren't being executed as BSR/BSF.

What's going on here?


This is simply a limitation in the micro-architecture of your Intel Haswell CPU and several previous1 CPUs. It has been fixed for tzcnt and lzcnt as of Skylake-S (client), but the issue remained for popcnt until it was fixed in Cannon Lake.

On those micro-architectures the destination operand for tzcnt, lzcnt and popcnt is treated as an input dependency even though, semantically, it is not. Now I doubt this is a really a "bug": if it was simply an unintended behavior/oversight, I expect it would have been fixed in one of the several new micro-architectures that have been released since it was introduced.

Most likely it is a design compromise based on one or both of the following two factors:

  • The hardware for popcnt, lzcnt and tzcnt is likely all shared with the existing bsf and bsr instructions. Now bsf and bsr do have a dependency on the previous destination value in practice2 for the special case of all-bits-zero input, since Intel chips leave the destination unmodified in that case. So it is entirely possible that the simplest design for the combined hardware resulted in the other similar instructions executing on the same unit inheriting the same dependency.

  • The vast majority of x86 two operand ALU instructions have an dependency on the destination operand, since it is used as a source as well. The three affected instructions are somewhat unique in that they are unary operators, but unlike existing unary operators like not and neg which have a single operand used as source and destination, they have distinct source and destination operands, making them superficially similar to most 2-input instructions. Perhaps the renamer/scheduler circuitry just doesn't distinguish the special case of these unary-with-two-register-operand versus the vast majority of plain shared source/destination 2-input instructions which don't have this dependency.

In fact, for the case of popcnt Intel has issued various errata covering the false dependency issue such as HSD146 for Haswell Desktop and SKL029 for Skylake, which reads:

POPCNT Instruction May Take Longer to Execute Than Expected

Problem POPCNT instruction execution with a 32 or 64 bit operand may be delayed until previous non -dependent instructions have executed.

Implication Software using the POPCNT instruction may experience lower performance than expected.

Workaround None identified

I always found this erratum unusual since it isn't really identifying any type of functional defect or non-conformance to specification which is the case for essentially all the other errata. Intel doesn't really document a specific performance model for the OoO execution engine and there are a ton of other performance "gotchas" that have appeared and disappeared over the years (many with a much larger impact than this very minor issue) that don't get documented in errata. Still, this perhaps provides some evidence that it can be considered a bug. Oddly, the erratum was never extended to include tzcnt or lzcnt which had the same issue when they were introduced.


1 Well tzcnt and lzcnt only appeared in Haswell, but the problem exists for popcnt as well which was introduced in Nehalem - but the false dependency problem perhaps only exists for Sandy Bridge or later.

2In practice, although not documented in the ISA docs, since the result for all-zero input was undefined in the Intel manuals. Most or all Intel chips implemented the behavior as leaving the destination register unchanged in this case, however.
AMD does document and guarantee that behaviour for bsf and bsr.

(But unfortunately those instructions are slower than tzcnt/lzcnt on AMD (extra uops, see https://uops.info/), so instead of taking advantage of that bsf behaviour, it would often be better for AMD CPUs to use rep bsf so it will decode as tzcnt on CPUs that know about that instruction, and test/cmov if you have enough free registers. But bsr gives different results to lzcnt even for non-zero input, so you might consider taking advantage of it.)


Along the lines of what @BrettHale suggested, it's possible (if odd) that you're hitting a corner-case partial flags update stall. The flag state should in theory simply be renamed away because the following add updates all flags, but if it isn't for some reason then it would introduce a loop-carried dependency, and inserting xor would break that dependency.

It's hard to know for certain if this is what's happening, but it looks at a casual glance to be the most likely explanation; you can test the hypothesis by replacing the xor with test (which also breaks the flags dependency but has no effect on register dependencies).