Why indirect increment is faster than direct inc?
Solution 1:
Ok, this is how you approach these things.
-
Try to reproduce it. Okay, it reproduces:
Benchmark Mode Cnt Score Error Units LoopInc.directInc thrpt 15 175.678 ± 1.118 ops/ms LoopInc.indirectInc thrpt 15 641.413 ± 9.722 ops/ms
-
Try to see the generated assembly with
-prof perfasm
. Long story short, it produces a lot of generated code, and so we probably want to limit the loop unrolling. But, it can affect the performance, and can pretty much be the cause. So, let's re-run with-XX:LoopUnrollLimit=1
. Okay, the score is lower, but the difference is still there, excellent:Benchmark Mode Cnt Score Error Units LoopInc.directInc thrpt 15 161.147 ± 6.101 ops/ms LoopInc.indirectInc thrpt 15 489.430 ± 1.698 ops/ms
-
Take another look at the generated code, still nothing that pops out to our eye. Well, this seems interesting. Let's get on this properly. Can we characterize the workload? Of course we can, with the help of
-prof perfnorm
, which normalizes the hardware counters per benchmark op. Let's see:Benchmark Mode Cnt Score Error Units LoopInc.directInc thrpt 15 161.875 ± 3.038 ops/ms LoopInc.directInc:·CPI thrpt 3 0.967 ± 0.196 #/op LoopInc.directInc:·L1-dcache-load-misses thrpt 3 0.394 ± 3.663 #/op LoopInc.directInc:·L1-dcache-loads thrpt 3 2149.594 ± 228.166 #/op LoopInc.directInc:·L1-dcache-store-misses thrpt 3 0.114 ± 1.001 #/op LoopInc.directInc:·L1-dcache-stores thrpt 3 1073.666 ± 96.066 #/op LoopInc.directInc:·L1-icache-load-misses thrpt 3 0.965 ± 22.984 #/op LoopInc.directInc:·LLC-loads thrpt 3 0.204 ± 2.763 #/op LoopInc.directInc:·LLC-stores thrpt 3 0.060 ± 0.633 #/op LoopInc.directInc:·branch-misses thrpt 3 536.068 ± 43.293 #/op LoopInc.directInc:·branches thrpt 3 3728.890 ± 220.539 #/op LoopInc.directInc:·cycles thrpt 3 26219.146 ± 6287.590 #/op LoopInc.directInc:·dTLB-load-misses thrpt 3 0.063 ± 0.124 #/op LoopInc.directInc:·dTLB-loads thrpt 3 2136.942 ± 165.990 #/op LoopInc.directInc:·dTLB-store-misses thrpt 3 0.022 ± 0.029 #/op LoopInc.directInc:·dTLB-stores thrpt 3 1084.787 ± 417.281 #/op LoopInc.directInc:·iTLB-load-misses thrpt 3 0.081 ± 0.333 #/op LoopInc.directInc:·iTLB-loads thrpt 3 3.623 ± 19.955 #/op LoopInc.directInc:·instructions thrpt 3 27114.052 ± 1843.720 #/op LoopInc.indirectInc thrpt 15 489.164 ± 2.692 ops/ms LoopInc.indirectInc:·CPI thrpt 3 0.281 ± 0.015 #/op LoopInc.indirectInc:·L1-dcache-load-misses thrpt 3 0.503 ± 9.071 #/op LoopInc.indirectInc:·L1-dcache-loads thrpt 3 2149.806 ± 369.040 #/op LoopInc.indirectInc:·L1-dcache-store-misses thrpt 3 0.167 ± 1.370 #/op LoopInc.indirectInc:·L1-dcache-stores thrpt 3 1073.895 ± 186.741 #/op LoopInc.indirectInc:·L1-icache-load-misses thrpt 3 0.313 ± 1.275 #/op LoopInc.indirectInc:·branch-misses thrpt 3 1.102 ± 0.375 #/op LoopInc.indirectInc:·branches thrpt 3 2143.670 ± 228.475 #/op LoopInc.indirectInc:·cycles thrpt 3 8701.665 ± 706.183 #/op LoopInc.indirectInc:·dTLB-load-misses thrpt 3 0.020 ± 0.301 #/op LoopInc.indirectInc:·dTLB-loads thrpt 3 2141.965 ± 135.852 #/op LoopInc.indirectInc:·dTLB-store-misses thrpt 3 0.002 ± 0.029 #/op LoopInc.indirectInc:·dTLB-stores thrpt 3 1070.376 ± 81.445 #/op LoopInc.indirectInc:·iTLB-load-misses thrpt 3 0.007 ± 0.135 #/op LoopInc.indirectInc:·iTLB-loads thrpt 3 0.310 ± 5.768 #/op LoopInc.indirectInc:·instructions thrpt 3 30968.207 ± 3627.540 #/op
Oh, both benchmarks have comparable number of instructions. The slower one takes more cycles (that's why CPI is also not ideal in
directInc
;indirectInc
, however, produces a close-to-ideal CPI). If you look closely at possible causes: there is not many cache misses, not many TLB misses, but slow benchmark has lots of branch misses. AHA! Now we know what to look in the generated code. -
Let's look at generated code again.
-prof perfasm
conveniently highlights the jumps. And then you will see this...directInc:
╭│ 0x00007fa0a82a50ff: jmp 0x00007fa0a82a5116 11.39% 16.90% ││ ↗ 0x00007fa0a82a5101: inc %edx ;*iinc ││ │ ; - org.openjdk.LoopInc::directInc@46 (line 18) 12.52% 23.11% ││ │↗↗ 0x00007fa0a82a5103: mov %r10,0xe8(%r11) ;*invokevirtual putLong ││ │││ ; - java.util.concurrent.ThreadLocalRandom::nextSeed@27 (line 241) 12.00% 8.14% ││ │││ 0x00007fa0a82a510a: inc %r8d ;*iinc ││ │││ ; - org.openjdk.LoopInc::directInc@46 (line 18) 0.03% 0.03% ││ │││ 0x00007fa0a82a510d: cmp $0x3e8,%r8d │╰ │││ 0x00007fa0a82a5114: jge 0x00007fa0a82a50c7 ;*aload_0 │ │││ ; - org.openjdk.LoopInc::directInc@11 (line 19) 0.80% 0.91% ↘ │││ 0x00007fa0a82a5116: mov 0xf0(%r11),%r10d ;*invokevirtual getInt │││ ; - java.util.concurrent.ThreadLocalRandom::current@9 (line 222) 4.28% 1.23% │││ 0x00007fa0a82a511d: test %r10d,%r10d ╭│││ 0x00007fa0a82a5120: je 0x00007fa0a82a517b ;*ifne ││││ ; - java.util.concurrent.ThreadLocalRandom::current@12 (line 222) 2.11% 0.01% ││││ 0x00007fa0a82a5122: movabs $0x9e3779b97f4a7c15,%r10 0.01% 0.07% ││││ 0x00007fa0a82a512c: add 0xe8(%r11),%r10 ;*ladd ││││ ; - java.util.concurrent.ThreadLocalRandom::nextSeed@24 (line 242) 7.73% 1.89% ││││ 0x00007fa0a82a5133: mov %r10,%r9 1.21% 1.84% ││││ 0x00007fa0a82a5136: shr $0x21,%r9 1.90% 0.03% ││││ 0x00007fa0a82a513a: xor %r10,%r9 2.02% 0.03% ││││ 0x00007fa0a82a513d: movabs $0xff51afd7ed558ccd,%rcx 0.94% 1.82% ││││ 0x00007fa0a82a5147: imul %rcx,%r9 ;*lmul ││││ ; - java.util.concurrent.ThreadLocalRandom::mix32@9 (line 182) 7.01% 2.40% ││││ 0x00007fa0a82a514b: mov %r9,%rcx ││││ 0x00007fa0a82a514e: shr $0x21,%rcx 1.89% 0.70% ││││ 0x00007fa0a82a5152: xor %r9,%rcx 3.11% 2.55% ││││ 0x00007fa0a82a5155: movabs $0xc4ceb9fe1a85ec53,%r9 0.99% 1.50% ││││ 0x00007fa0a82a515f: imul %r9,%rcx 7.66% 2.89% ││││ 0x00007fa0a82a5163: shr $0x20,%rcx 3.70% 1.97% ││││ 0x00007fa0a82a5167: mov %ecx,%r9d 0.11% ││││ 0x00007fa0a82a516a: and $0x1,%r9d ;*iand ││││ ; - java.util.concurrent.ThreadLocalRandom::nextInt@34 (line 356) 3.76% 11.13% ││││ 0x00007fa0a82a516e: cmp $0x1,%r9d │╰││ 0x00007fa0a82a5172: je 0x00007fa0a82a5101 10.48% 16.62% │ ││ 0x00007fa0a82a5174: test %r9d,%r9d │ ╰│ 0x00007fa0a82a5177: je 0x00007fa0a82a5103 ;*lookupswitch │ │ ; - org.openjdk.LoopInc::directInc@15 (line 19) │ ╰ 0x00007fa0a82a5179: jmp 0x00007fa0a82a5103 ;*aload_0 │ ; - org.openjdk.LoopInc::directInc@11 (line 19) ↘ 0x00007fa0a82a517b: mov $0xffffff5d,%esi
indirectInc:
0.01% 0.01% ↗ 0x00007f65588d8260: mov %edx,%r9d 0.01% │ 0x00007f65588d8263: nopw 0x0(%rax,%rax,1) 11.99% 11.38% │ 0x00007f65588d826c: data16 data16 xchg %ax,%ax ;*iconst_0 │ ; - org.openjdk.LoopInc::indirectInc@11 (line 34) │ 0x00007f65588d8270: mov 0xf0(%r8),%r10d ;*invokevirtual getInt │ ; - java.util.concurrent.ThreadLocalRandom::current@9 (line 222) │ 0x00007f65588d8277: test %r10d,%r10d │ 0x00007f65588d827a: je 0x00007f65588d8331 ;*ifne │ ; - java.util.concurrent.ThreadLocalRandom::current@12 (line 222) 0.01% │ 0x00007f65588d8280: movabs $0x9e3779b97f4a7c15,%r10 11.80% 11.49% │ 0x00007f65588d828a: add 0xe8(%r8),%r10 ;*ladd │ ; - java.util.concurrent.ThreadLocalRandom::nextSeed@24 (line 242) 0.01% 0.01% │ 0x00007f65588d8291: mov %r10,0xe8(%r8) ;*invokevirtual putLong │ ; - java.util.concurrent.ThreadLocalRandom::nextSeed@27 (line 241) │ 0x00007f65588d8298: mov %r9d,%edx 0.01% 0.01% │ 0x00007f65588d829b: inc %edx 11.12% 12.40% │ 0x00007f65588d829d: mov %r10,%rcx 0.01% │ 0x00007f65588d82a0: shr $0x21,%rcx 0.03% │ 0x00007f65588d82a4: xor %r10,%rcx 0.06% 0.03% │ 0x00007f65588d82a7: movabs $0xff51afd7ed558ccd,%r10 12.38% 13.94% │ 0x00007f65588d82b1: imul %r10,%rcx ;*lmul │ ; - java.util.concurrent.ThreadLocalRandom::mix32@9 (line 182) 0.03% 0.01% │ 0x00007f65588d82b5: mov %rcx,%r10 │ 0x00007f65588d82b8: shr $0x21,%r10 0.03% │ 0x00007f65588d82bc: xor %rcx,%r10 11.43% 12.62% │ 0x00007f65588d82bf: movabs $0xc4ceb9fe1a85ec53,%rcx 0.01% │ 0x00007f65588d82c9: imul %rcx,%r10 0.34% 0.30% │ 0x00007f65588d82cd: shr $0x20,%r10 0.85% 0.76% │ 0x00007f65588d82d1: mov %r10d,%r10d 11.81% 11.51% │ 0x00007f65588d82d4: and $0x1,%r10d 2.16% 1.78% │ 0x00007f65588d82d8: cmp $0x1,%r10d 3.45% 3.00% │ 0x00007f65588d82dc: cmovne %r9d,%edx <----- HERE IT IS 17.55% 15.86% │ 0x00007f65588d82e0: inc %r11d ;*iinc │ ; - org.openjdk.LoopInc::indirectInc@56 (line 33) │ 0x00007f65588d82e3: cmp $0x3e8,%r11d ╰ 0x00007f65588d82ea: jl 0x00007f65588d8260 ;*if_icmpge ; - org.openjdk.LoopInc::indirectInc@8 (line 33)
Notice the
cmovne
instead ofjmp
-- this is why we have more "predictable" branches. HotSpot profiles the branches, and emits the conditional move when the branch profile branch is very flat. In other words, dodge a very likely branch misprediction by paying a bit for the added latency of conditional move. However, in this case, switch is special: it has more than two alternatives (0, 1, and "nothing"). This is why, I speculate, theresult
increment is not being folded into cmov. (Generally speaking, HotSpot could have stored zero toresult
in "default", but it blew it, oh well) -
To confirm that hypothesis, let's make a
directCompleteInc
case, where we still useswitch
, but now cover all the cases:@Benchmark public int directCompleteInc() { int result = 0; for (int i = 0; i < 1000; i++) { switch (getValue()) { case 1: result++; break; default: break; } } return result; }
...and measure it, and this time without any options, like the OP did:
Benchmark Mode Cnt Score Error Units LoopInc.directCompleteInc thrpt 5 644.414 ± 0.371 ops/ms LoopInc.directInc thrpt 5 174.974 ± 0.103 ops/ms LoopInc.indirectInc thrpt 5 644.015 ± 0.533 ops/ms
THERE.
Confirm
directCompleteInc
is usingcmov
with-prof perfasm
. It does.Drink up.