Why is (a*b != 0) faster than (a != 0 && b != 0) in Java?
Solution 1:
I'm ignoring the issue that your benchmarking might be flawed, and taking the result at face value.
Is it the compiler or is it at the hardware level?
That latter, I think:
if (a != 0 && b != 0)
will compile to 2 memory loads and two conditional branches
if (a * b != 0)
will compile to 2 memory loads, a multiply and one conditional branch.
The multiply is likely to be faster than the second conditional branch if the hardware-level branch prediction is ineffective. As you increase the ratio ... the branch prediction is becoming less effective.
The reason that conditional branches are slower is that they cause the instruction execution pipeline to stall. Branch prediction is about avoiding the stall by predicting which way the branch is going to go and speculatively choosing the next instruction based on that. If the prediction fails, there is a delay while the instruction for the other direction is loaded.
(Note: the above explanation is oversimplified. For a more accurate explanation, you need to look at the literature provided by the CPU manufacturer for assembly language coders and compiler writers. The Wikipedia page on Branch Predictors is good background.)
However, there is one thing that you need to be careful about with this optimization. Are there any values where a * b != 0
will give the wrong answer? Consider cases where computing the product results in integer overflow.
UPDATE
Your graphs tend to confirm what I said.
There is also a "branch prediction" effect in the conditional branch
a * b != 0
case, and this comes out in the graphs.If you project the curves beyond 0.9 on the X-axis, it looks like 1) they will meet at about 1.0 and 2) the meeting point will be at roughly the same Y value as for X = 0.0.
UPDATE 2
I don't understand why the curves are different for the a + b != 0
and the a | b != 0
cases. There could be something clever in the branch predictors logic. Or it could indicate something else.
(Note that this kind of thing can be specific to a particular chip model number or even version. The results of your benchmarks could be different on other systems.)
However, they both have the advantage of working for all non-negative values of a
and b
.
Solution 2:
I think your benchmark has some flaws and might not be useful for inferring about real programs. Here are my thoughts:
(a|b)!=0
and(a+b)!=0
test if either value is non-zero, whereasa != 0 && b != 0
and(a*b)!=0
test if both are non-zero. So you are not comparing the timing of just the arithmetic: if the condition is true more often, it causes more executions of theif
body, which takes more time too.(a+b)!=0
will do the wrong thing for positive and negative values that sum to zero, so you can't use it in the general case, even if it works here.Similarly,
(a*b)!=0
will do the wrong thing for values that overflow. (Random example: 196608 * 327680 is 0 because the true result happens to be divisible by 232, so its low 32 bits are 0, and those bits are all you get if it's anint
operation.)The VM will optimize the expression during the first few runs of the outer (
fraction
) loop, whenfraction
is 0, when the branches are almost never taken. The optimizer may do different things if you startfraction
at 0.5.Unless the VM is able to eliminate some of the array bounds checks here, there are four other branches in the expression just due to the bounds checks, and that's a complicating factor when trying to figure out what's happening at a low level. You might get different results if you split the two-dimensional array into two flat arrays, changing
nums[0][i]
andnums[1][i]
tonums0[i]
andnums1[i]
.CPU branch predictors detect short patterns in the data, or runs of all branches being taken or not taken. Your randomly generated benchmark data is the worst-case scenario for a branch predictor. If real-world data has a predictable pattern, or it has long runs of all-zero and all-non-zero values, the branches could cost much less.
The particular code that is executed after the condition is met can affect the performance of evaluating the condition itself, because it affects things like whether or not the loop can be unrolled, which CPU registers are available, and if any of the fetched
nums
values need to be reused after evaluating the condition. Merely incrementing a counter in the benchmark is not a perfect placeholder for what real code would do.System.currentTimeMillis()
is on most systems not more accurate than +/- 10 ms.System.nanoTime()
is usually more accurate.
There are lots of uncertainties, and it's always hard to say anything definite with these sorts of micro-optimizations because a trick that is faster on one VM or CPU can be slower on another. If running the 32-bit HotSpot JVM, rather than the 64-bit version, be aware that it comes in two flavors: with the "Client" VM having different (weaker) optimizations compared to the "Server" VM.
If you can disassemble the machine code generated by the VM, do that rather than trying to guess what it does!