Test whether a register is zero with CMP reg,0 vs OR reg,reg?
Yes, there is a difference in performance.
The best choice for comparing a register with zero is test reg, reg
. It sets FLAGS the same way cmp reg,0
would, and is at least as fast1 as any other way, with smaller code-size.
(Even better is when ZF
is already set appropriately by the instruction that set reg
so you can just branch, setcc, or cmovcc directly. For example, the bottom of a normal loop often looks like dec ecx
/ jnz .loop_top
. Most x86 integer instructions "set flags according to the result", including ZF=1 if the output was 0
.).
or reg,reg
can't macro-fuse with a JCC into a single uop on any existing x86 CPUs, and adds latency for anything that later reads reg
because it rewrites the value into the register. cmp
's downside is usually just code-size.
Footnote 1: There is a possible exception, but only on obsolete P6-family CPUs (Intel up to Nehalem, replaced by Sandybridge-family in 2011). See below about avoiding register-read stalls by rewriting the same value into a register. Other microarchitecture families don't have such stalls, and there's never any upside to or
over test
.
The FLAGS results of test reg,reg
/ and reg,reg
/ or reg,reg
are
identical to cmp reg, 0
in all cases (except for AF) because:
-
CF = OF = 0
becausetest
/and
always do that, and forcmp
because subtracting zero can't overflow or carry. -
ZF
,SF
,PF
set according to the result (i.e.reg
):reg®
for test, orreg - 0
for cmp.
(AF
is undefined after test
, but set according to the result for cmp
. I'm ignoring it because it's really obscure: the only instructions that read AF are the ASCII-adjust packed-BCD instructions like AAS
, and lahf
/ pushf
.)
You can of course check conditions other than reg == 0
(ZF), e.g. test for negative signed integers by looking at SF. But fun fact: jl
, the signed less-than condition, is more efficient than js
on some CPUs after a cmp
. They're equivalent after compare against zero because OF=0 so the l
condition (SF!=OF
) is equivalent to SF
.
Every CPU that can macro-fuse TEST/JL can also macro-fuse TEST/JS, even Core 2. But after CMP byte [mem], 0
, always use JL not JS to branch on the sign bit because Core 2 can't macro-fuse that. (At least in 32-bit mode; Core 2 can't macro-fuse at all in 64-bit mode).
The signed-compare conditions also let you do stuff like jle
or jg
, looking at ZF as well as SF!=OF.
test
is shorter to encode than cmp
with immediate 0, in all cases except the cmp al, imm8
special case which is still two bytes.
Even then, test
is preferable for macro-fusion reasons (with jle
and similar on Core2), and because having no immediate at all can possibly help uop-cache density by leaving a slot that another instruction can borrow if it needs more space (SnB-family).
Macro-fusion of test/jcc into a single uop in the decoders
The decoders in Intel and AMD CPUs can internally macro-fuse test
and cmp
with some conditional branch instructions into a single compare-and-branch operation. This gives you a max throughput of 5 instructions per cycle when macro-fusion happens, vs. 4 without macro-fusion. (For Intel CPUs since Core2.)
Recent Intel CPUs can macro-fuse some instructions (like and
and add
/sub
) as well as test
and cmp
, but or
is not one of them. AMD CPUs can only merge test
and cmp
with a JCC. See x86_64 - Assembly - loop conditions and out of order, or just refer directly to Agner Fog's microarch docs for the details of which CPU can macro-fuse what. test
can macro-fuse in some cases where cmp
can't, e.g. with js
.
Almost all simple ALU ops (bitwise boolean, add/sub, etc.) run in a single cycle. They all have the same "cost" in tracking them through the out-of-order execution pipeline. Intel and AMD spend the transistors to make fast execution units to add/sub/whatever in a single cycle. Yes, bitwise OR
or AND
is simpler, and probably uses slightly less power, but still can't run any faster than one clock cycle.
or reg, reg
adds another cycle of latency to the dependency chain for following instructions that need to read the register. It's an x |= x
in the chain of operations that lead to the value you want.
You might think that extra register write would also need an extra physical register-file (PRF) entry vs. test
, but that's probably not the case. (See https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ for more about PRF capacity impact on out-of-order exec).
test
has to produce its FLAGS output somewhere. On Intel Sandybridge-family CPUs at least, when an instruction produces a register and a FLAGS result, both of them are stored together in the same PRF entry. (Source: an Intel patent I think. This is from memory but seems like an obviously sane design.)
An instruction like cmp
or test
that only produces a FLAGS result also needs a PRF entry for its output. Presumably this is slightly worse: the old physical register is still "alive", referenced as the holder of the value of the architectural register written by some older instruction. And now architectural EFLAGS (or more specifically, both the separate-renamed CF and SPAZO flag groups) point to this new physical register in the RAT (register allocation table) updated by the renamer. Of course, the next FLAGS-writing instruction will overwrite that, allowing that PR to be freed once all its readers have read it and executed. This is not something I think about when optimizing, and I don't think tends to matter in practice.
P6-family register-read stalls: possible upside to or reg,reg
P6-family CPUs (PPro / PII to Nehalem) have a limited number of register-read ports for the issue/rename stage to read "cold" values (not forwarded from an in-flight instruction) from the permanent register file, but recently-written values are available directly from the ROB. Rewriting a register unnecessarily can make it live in the forwarding network again to help avoid register-read stalls. (See Agner Fog's microarch pdf).
Re-writing a register with the same value on purpose to keep it "hot" can actually be an optimization for some cases of surrounding code, on P6. Early P6 family CPUs couldn't do macro-fusion at all, so you aren't even missing out on that by using and reg,reg
instead of test
. But Core 2 (in 32-bit mode) and Nehalem (in any mode) can macro-fuse test/jcc so you're missing out on that.
(and
is equivalent to or
for this purpose on P6-family, but less bad if your code ever runs on a Sandybridge-family CPU: it can macro-fuse and
/jcc
but not or
/jcc
. The extra cycle of latency in the dep-chain for the register is still a disadvantage on P6, especially if the critical path involving it is the main bottleneck.)
P6 family is very much obsolete these days (Sandybridge replaced it in 2011), and CPUs before Core 2 (Core, Pentium M, PIII, PII, PPro) are very obsolete and getting into retrocomputing territory, especially for anything where performance matters. You can ignore P6-family when optimizing unless you have a specific target machine in mind (e.g. if you have a crusty old Nehalem Xeon machine) or you're tuning a compiler's -mtune=nehalem
settings for the few users still left.
If you're tuning something to be fast on Core 2 / Nehalem, use test
unless profiling shows that register-read stalls are a big problem in a specific case, and using and
actually fixes it.
On earlier P6-family, and reg,reg
might be ok as your default code-gen choice when the value isn't part of a problematic loop-carried dep chain, but is read later. Or if it is, but there's also a specific register-read stall that you can fix with and reg,reg
.
If you only want to test the low 8 bits of a full register, test al,al
avoids writing a partial-register, which on P6-family is renamed separately from the full EAX/RAX. or al,al
is much worse if you later read EAX or AX: partial-register stall on P6-family.(Why doesn't GCC use partial registers?)
History of the unfortunate or reg,reg
idiom
The or reg,reg
idiom may have came from 8080 ORA A
, as pointed out in a comment.
8080's instruction set doesn't have a test
instruction, so your choices for setting flags according to a value included ORA A
and ANA A
. (Notice that the A
register destination is baked in to the mnemonic for both those instructions, and there aren't instructions to OR into different registers: it's a 1-address machine except for mov
, while 8086 is a 2-address machine for most instructions.)
8080 ORA A
was the usual go-to way to do it, so presumably that habit carried over into 8086 assembly programming as people ported their asm sources. (Or used automatic tools; 8086 was intentionally designed for easy / automatic asm-source porting from 8080 code.)
This bad idiom continues to be blindly used by beginners, presumably taught by people who learned it back in the day and passed it on without thinking about the obvious critical path latency downside for out-of-order execution. (Or the other more subtle problems like no macro-fusion.)
Delphi's compiler reportedly uses or eax,eax
, which was maybe a reasonable choice at the time (before Core 2), assuming that register-read stalls were more important than lengthening the dep chain for whatever reads it next. IDK if that's true or they were just using the ancient idiom without thinking about it.
Unfortunately, compiler-writers at the time didn't know the future, because and eax,eax
performs exactly equivalently to or eax,eax
on Intel P6-family, but is less bad on other uarches because and
can macro-fuse on Sandybridge-family. (See the P6 section above).
Value in memory: maybe use cmp
or load it into a reg.
To test a value in memory, you can cmp dword [mem], 0
, but Intel CPUs can't macro-fuse flag-setting instructions that have both an immediate and a memory operand. If you're going to use the value after the compare in one side of the branch, you should mov eax, [mem]
/ test eax,eax
or something. If not, either way is 2 front-end uops, but it's a tradeoff between code-size and back-end uop count.
Although note that some addressing modes won't micro-fuse either on SnB-family: RIP-relative + immediate won't micro-fuse in the decoders, or an indexed addressing mode will un-laminate after the uop-cache. Either way leading to 3 fused-domain uops for cmp dword [rsi + rcx*4], 0
/ jne
or [rel some_static_location]
.
On i7-6700k Skylake (tested with perf events uops_issued.any
and uops_executed.thread
):
-
mov reg, [mem]
(ormovzx
) +test reg,reg / jnz
2 uops in both fused and unfused domains, regardless of addressing mode, ormovzx
instead of mov. Nothing to micro-fuse; does macro-fuse. -
cmp byte [rip+static_var], 0
+jne
. 3 fused, 3 unfused. (front and back ends). The RIP-relative + immediate combination prevents micro-fusion. It also doesn't macro-fuse. Smaller code-size but less efficient. -
cmp byte [rsi + rdi], 0
(indexed addr mode) /jne
3 fused, 3 unfused. Micro-fuses in the decoders, but un-laminates at issue/rename. Doesn't macro-fuse. -
cmp byte [rdi + 16], 0
+jne
2 fused, 3 unfused uops. Micro-fusion of cmp load+ALU did happen because of the simple addressing mode, but the immediate prevents macro-fusion. About as good as load + test + jnz: smaller code-size but 1 extra back-end uop.
If you have a 0
in a register (or a 1
if you want to compare a bool), you can cmp [mem], reg
/ jne
for even fewer uops, as low as 1 fused-domain, 2 unfused. But RIP-relative addressing modes still don't macro-fuse.
Compilers tend to use load + test/jcc even when the value isn't used later.
You could also test a value in memory with test dword [mem], -1
, but don't. Since test r/m16/32/64, sign-extended-imm8
isn't available, it's worse code-size than cmp
for anything larger than bytes. (I think the design idea was that if you you only want to test the low bit of a register, just test cl, 1
instead of test ecx, 1
, and use cases like test ecx, 0xfffffff0
are rare enough that it wasn't worth spending an opcode. Especially since that decision was made for 8086 with 16-bit code, where it was only the difference between an imm8 and imm16, not imm32.)
(I wrote -1 rather than 0xFFFFFFFF so it would be the same with byte
or qword
. ~0
would be another way to write it.)
Related:
- What is instruction fusion in contemporary x86 processors? (micro- and macro-fusion). TODO: move the test results there (and update my answer there to fix some things that don't match my current results.)
- x86_64 - Assembly - loop conditions and out of order (which instructions can macro-fusion on Sandybridge-family)
It depends on the exact code sequence, which specific CPU it is, and other factors.
The main problem with or al, al,
is that it "modifies" EAX
, which means that a subsequent instruction that uses EAX
in some way may stall until this instruction completes. Note that the conditional branch (jz
) also depends on the instruction, but CPU manufacturers do a lot of work (branch prediction and speculative execution) to mitigate that. Also note that in theory it would be possible for a CPU manufacturer to design a CPU that recognises EAX
isn't changed in this specific case, but there are hundreds of these special cases and the benefits of recognising most of them are too little.
The main problem with cmp al,0
is that it's slightly larger, which might mean slower instruction fetch/more cache pressure, and (if it is a loop) might mean that the code no longer fits in some CPU's "loop buffer".
As Jester pointed out in comments; test al,al
avoids both problems - it's smaller than cmp al,0
and doesn't modify EAX
.
Of course (depending on the specific sequence) the value in AL
must've come from somewhere, and if it came from an instruction that set flags appropriately it might be possible to modify the code to avoid using another instruction to set flags again later.