Strange branching performance
The results of my benchmark shows that the performance is worst when the branch has 15% (or 85%) percent probability rather than 50%.
Any explanation?
The code is too long but the relevant parts are here:
private int diff(char c) {
return TABLE[(145538857 * c) >>> 27] - c;
}
@Benchmark int timeBranching(int reps) {
int result = 0;
while (reps-->0) {
for (final char c : queries) {
if (diff(c) == 0) {
++result;
}
}
}
return result;
}
It counts the number of BREAKING_WHITESPACE characters in the given string. The results shows a sudden time drop (performance increase) when the branching probability reaches about 0.20.
More details about the drop. Varying the seed shows more strange things happening. Note that the black line denoting the minimum and maximum values is very short except when close to the cliff.
Solution 1:
It looks like a minor JIT bug. For a small branch probability, it generates something like the following, just much more complicated due to unrolling (I'm simplifying a lot):
movzwl 0x16(%r8,%r10,2),%r9d
Get the char: int r9d = queries[r10]
imul $0x8acbf29,%r9d,%ebx
Multiply: ebx = 145538857 * r9d
shr $0x1b,%ebx
Shift: ebx >>>= 27
cmp %edx,%ebx
jae somewhere
Check bounds: if (ebx > edx || ebx < 0) goto somewhere
(and throw there an IndexOutOfBoundsException
.
cmp %r9d,%ebx
jne back
Jump back to loop beginning if not equal: if (r9d != ebx) goto back
inc %eax
Increment the result: eax++
jne back
Simply goto back
We can see one smart and one dumb thing:
- The bound check gets done optimally with a single unsigned comparison.
- It's completely redundant as x >>> 27 is always positive and less than the table length (32).
For a branch probability above 20%, these three instructions
cmp %r9d,%ebx
jne back
inc %eax
get replaced by something like
mov %eax,%ecx // ecx = result
inc %ecx // ecx++
cmp %r9d,%ebx // if (r9b == index)
cmoveq %ecx,%ebx // result = ecx
using a conditional move instruction. This is one instruction more, but no branching and thus no branch misprediction penalty.
This explains the very constant time in the 20-80% range. The ramping below 20% is clearly due to branch mispredictions.
So it looks like a JIT failure to use the proper threshold of about 0.04 rather than 0.18.