Why indirect increment is faster than direct inc?

Solution 1:

Ok, this is how you approach these things.

  1. Try to reproduce it. Okay, it reproduces:

    Benchmark             Mode  Cnt    Score   Error   Units
    LoopInc.directInc    thrpt   15  175.678 ± 1.118  ops/ms
    LoopInc.indirectInc  thrpt   15  641.413 ± 9.722  ops/ms
    
  2. Try to see the generated assembly with -prof perfasm. Long story short, it produces a lot of generated code, and so we probably want to limit the loop unrolling. But, it can affect the performance, and can pretty much be the cause. So, let's re-run with -XX:LoopUnrollLimit=1. Okay, the score is lower, but the difference is still there, excellent:

    Benchmark             Mode  Cnt    Score   Error   Units
    LoopInc.directInc    thrpt   15  161.147 ± 6.101  ops/ms
    LoopInc.indirectInc  thrpt   15  489.430 ± 1.698  ops/ms
    
  3. Take another look at the generated code, still nothing that pops out to our eye. Well, this seems interesting. Let's get on this properly. Can we characterize the workload? Of course we can, with the help of -prof perfnorm, which normalizes the hardware counters per benchmark op. Let's see:

    Benchmark                                     Mode  Cnt      Score      Error   Units
    LoopInc.directInc                            thrpt   15    161.875 ±    3.038  ops/ms
    LoopInc.directInc:·CPI                       thrpt    3      0.967 ±    0.196    #/op
    LoopInc.directInc:·L1-dcache-load-misses     thrpt    3      0.394 ±    3.663    #/op
    LoopInc.directInc:·L1-dcache-loads           thrpt    3   2149.594 ±  228.166    #/op
    LoopInc.directInc:·L1-dcache-store-misses    thrpt    3      0.114 ±    1.001    #/op
    LoopInc.directInc:·L1-dcache-stores          thrpt    3   1073.666 ±   96.066    #/op
    LoopInc.directInc:·L1-icache-load-misses     thrpt    3      0.965 ±   22.984    #/op
    LoopInc.directInc:·LLC-loads                 thrpt    3      0.204 ±    2.763    #/op
    LoopInc.directInc:·LLC-stores                thrpt    3      0.060 ±    0.633    #/op
    LoopInc.directInc:·branch-misses             thrpt    3    536.068 ±   43.293    #/op
    LoopInc.directInc:·branches                  thrpt    3   3728.890 ±  220.539    #/op
    LoopInc.directInc:·cycles                    thrpt    3  26219.146 ± 6287.590    #/op
    LoopInc.directInc:·dTLB-load-misses          thrpt    3      0.063 ±    0.124    #/op
    LoopInc.directInc:·dTLB-loads                thrpt    3   2136.942 ±  165.990    #/op
    LoopInc.directInc:·dTLB-store-misses         thrpt    3      0.022 ±    0.029    #/op
    LoopInc.directInc:·dTLB-stores               thrpt    3   1084.787 ±  417.281    #/op
    LoopInc.directInc:·iTLB-load-misses          thrpt    3      0.081 ±    0.333    #/op
    LoopInc.directInc:·iTLB-loads                thrpt    3      3.623 ±   19.955    #/op
    LoopInc.directInc:·instructions              thrpt    3  27114.052 ± 1843.720    #/op
    
    LoopInc.indirectInc                          thrpt   15    489.164 ±    2.692  ops/ms
    LoopInc.indirectInc:·CPI                     thrpt    3      0.281 ±    0.015    #/op
    LoopInc.indirectInc:·L1-dcache-load-misses   thrpt    3      0.503 ±    9.071    #/op
    LoopInc.indirectInc:·L1-dcache-loads         thrpt    3   2149.806 ±  369.040    #/op
    LoopInc.indirectInc:·L1-dcache-store-misses  thrpt    3      0.167 ±    1.370    #/op
    LoopInc.indirectInc:·L1-dcache-stores        thrpt    3   1073.895 ±  186.741    #/op
    LoopInc.indirectInc:·L1-icache-load-misses   thrpt    3      0.313 ±    1.275    #/op
    LoopInc.indirectInc:·branch-misses           thrpt    3      1.102 ±    0.375    #/op
    LoopInc.indirectInc:·branches                thrpt    3   2143.670 ±  228.475    #/op
    LoopInc.indirectInc:·cycles                  thrpt    3   8701.665 ±  706.183    #/op
    LoopInc.indirectInc:·dTLB-load-misses        thrpt    3      0.020 ±    0.301    #/op
    LoopInc.indirectInc:·dTLB-loads              thrpt    3   2141.965 ±  135.852    #/op
    LoopInc.indirectInc:·dTLB-store-misses       thrpt    3      0.002 ±    0.029    #/op
    LoopInc.indirectInc:·dTLB-stores             thrpt    3   1070.376 ±   81.445    #/op
    LoopInc.indirectInc:·iTLB-load-misses        thrpt    3      0.007 ±    0.135    #/op
    LoopInc.indirectInc:·iTLB-loads              thrpt    3      0.310 ±    5.768    #/op
    LoopInc.indirectInc:·instructions            thrpt    3  30968.207 ± 3627.540    #/op
    

    Oh, both benchmarks have comparable number of instructions. The slower one takes more cycles (that's why CPI is also not ideal in directInc; indirectInc, however, produces a close-to-ideal CPI). If you look closely at possible causes: there is not many cache misses, not many TLB misses, but slow benchmark has lots of branch misses. AHA! Now we know what to look in the generated code.

  4. Let's look at generated code again. -prof perfasm conveniently highlights the jumps. And then you will see this...

    directInc:

                      ╭│      0x00007fa0a82a50ff: jmp    0x00007fa0a82a5116
     11.39%   16.90%  ││ ↗    0x00007fa0a82a5101: inc    %edx               ;*iinc
                      ││ │                                                  ; - org.openjdk.LoopInc::directInc@46 (line 18)
     12.52%   23.11%  ││ │↗↗  0x00007fa0a82a5103: mov    %r10,0xe8(%r11)    ;*invokevirtual putLong
                      ││ │││                                                ; - java.util.concurrent.ThreadLocalRandom::nextSeed@27 (line 241)
     12.00%    8.14%  ││ │││  0x00007fa0a82a510a: inc    %r8d               ;*iinc
                      ││ │││                                                ; - org.openjdk.LoopInc::directInc@46 (line 18)
      0.03%    0.03%  ││ │││  0x00007fa0a82a510d: cmp    $0x3e8,%r8d
                      │╰ │││  0x00007fa0a82a5114: jge    0x00007fa0a82a50c7  ;*aload_0
                      │  │││                                                ; - org.openjdk.LoopInc::directInc@11 (line 19)
      0.80%    0.91%  ↘  │││  0x00007fa0a82a5116: mov    0xf0(%r11),%r10d   ;*invokevirtual getInt
                         │││                                                ; - java.util.concurrent.ThreadLocalRandom::current@9 (line 222)
      4.28%    1.23%     │││  0x00007fa0a82a511d: test   %r10d,%r10d
                        ╭│││  0x00007fa0a82a5120: je     0x00007fa0a82a517b  ;*ifne
                        ││││                                                ; - java.util.concurrent.ThreadLocalRandom::current@12 (line 222)
      2.11%    0.01%    ││││  0x00007fa0a82a5122: movabs $0x9e3779b97f4a7c15,%r10
      0.01%    0.07%    ││││  0x00007fa0a82a512c: add    0xe8(%r11),%r10    ;*ladd
                        ││││                                                ; - java.util.concurrent.ThreadLocalRandom::nextSeed@24 (line 242)
      7.73%    1.89%    ││││  0x00007fa0a82a5133: mov    %r10,%r9
      1.21%    1.84%    ││││  0x00007fa0a82a5136: shr    $0x21,%r9
      1.90%    0.03%    ││││  0x00007fa0a82a513a: xor    %r10,%r9
      2.02%    0.03%    ││││  0x00007fa0a82a513d: movabs $0xff51afd7ed558ccd,%rcx
      0.94%    1.82%    ││││  0x00007fa0a82a5147: imul   %rcx,%r9           ;*lmul
                        ││││                                                ; - java.util.concurrent.ThreadLocalRandom::mix32@9 (line 182)
      7.01%    2.40%    ││││  0x00007fa0a82a514b: mov    %r9,%rcx
                        ││││  0x00007fa0a82a514e: shr    $0x21,%rcx
      1.89%    0.70%    ││││  0x00007fa0a82a5152: xor    %r9,%rcx
      3.11%    2.55%    ││││  0x00007fa0a82a5155: movabs $0xc4ceb9fe1a85ec53,%r9
      0.99%    1.50%    ││││  0x00007fa0a82a515f: imul   %r9,%rcx
      7.66%    2.89%    ││││  0x00007fa0a82a5163: shr    $0x20,%rcx
      3.70%    1.97%    ││││  0x00007fa0a82a5167: mov    %ecx,%r9d
               0.11%    ││││  0x00007fa0a82a516a: and    $0x1,%r9d          ;*iand
                        ││││                                                ; - java.util.concurrent.ThreadLocalRandom::nextInt@34 (line 356)
      3.76%   11.13%    ││││  0x00007fa0a82a516e: cmp    $0x1,%r9d
                        │╰││  0x00007fa0a82a5172: je     0x00007fa0a82a5101
     10.48%   16.62%    │ ││  0x00007fa0a82a5174: test   %r9d,%r9d
                        │ ╰│  0x00007fa0a82a5177: je     0x00007fa0a82a5103  ;*lookupswitch
                        │  │                                                ; - org.openjdk.LoopInc::directInc@15 (line 19)
                        │  ╰  0x00007fa0a82a5179: jmp    0x00007fa0a82a5103  ;*aload_0
                        │                                                   ; - org.openjdk.LoopInc::directInc@11 (line 19)
                        ↘     0x00007fa0a82a517b: mov    $0xffffff5d,%esi
    

    indirectInc:

      0.01%    0.01%  ↗  0x00007f65588d8260: mov    %edx,%r9d
      0.01%           │  0x00007f65588d8263: nopw   0x0(%rax,%rax,1)
     11.99%   11.38%  │  0x00007f65588d826c: data16 data16 xchg %ax,%ax  ;*iconst_0
                      │                                                ; - org.openjdk.LoopInc::indirectInc@11 (line 34)
                      │  0x00007f65588d8270: mov    0xf0(%r8),%r10d    ;*invokevirtual getInt
                      │                                                ; - java.util.concurrent.ThreadLocalRandom::current@9 (line 222)
                      │  0x00007f65588d8277: test   %r10d,%r10d
                      │  0x00007f65588d827a: je     0x00007f65588d8331  ;*ifne
                      │                                                ; - java.util.concurrent.ThreadLocalRandom::current@12 (line 222)
      0.01%           │  0x00007f65588d8280: movabs $0x9e3779b97f4a7c15,%r10
     11.80%   11.49%  │  0x00007f65588d828a: add    0xe8(%r8),%r10     ;*ladd
                      │                                                ; - java.util.concurrent.ThreadLocalRandom::nextSeed@24 (line 242)
      0.01%    0.01%  │  0x00007f65588d8291: mov    %r10,0xe8(%r8)     ;*invokevirtual putLong
                      │                                                ; - java.util.concurrent.ThreadLocalRandom::nextSeed@27 (line 241)
                      │  0x00007f65588d8298: mov    %r9d,%edx
      0.01%    0.01%  │  0x00007f65588d829b: inc    %edx
     11.12%   12.40%  │  0x00007f65588d829d: mov    %r10,%rcx
               0.01%  │  0x00007f65588d82a0: shr    $0x21,%rcx
      0.03%           │  0x00007f65588d82a4: xor    %r10,%rcx
      0.06%    0.03%  │  0x00007f65588d82a7: movabs $0xff51afd7ed558ccd,%r10
     12.38%   13.94%  │  0x00007f65588d82b1: imul   %r10,%rcx          ;*lmul
                      │                                                ; - java.util.concurrent.ThreadLocalRandom::mix32@9 (line 182)
      0.03%    0.01%  │  0x00007f65588d82b5: mov    %rcx,%r10
                      │  0x00007f65588d82b8: shr    $0x21,%r10
               0.03%  │  0x00007f65588d82bc: xor    %rcx,%r10
     11.43%   12.62%  │  0x00007f65588d82bf: movabs $0xc4ceb9fe1a85ec53,%rcx
               0.01%  │  0x00007f65588d82c9: imul   %rcx,%r10
      0.34%    0.30%  │  0x00007f65588d82cd: shr    $0x20,%r10
      0.85%    0.76%  │  0x00007f65588d82d1: mov    %r10d,%r10d
     11.81%   11.51%  │  0x00007f65588d82d4: and    $0x1,%r10d
      2.16%    1.78%  │  0x00007f65588d82d8: cmp    $0x1,%r10d
      3.45%    3.00%  │  0x00007f65588d82dc: cmovne %r9d,%edx         <----- HERE IT IS
     17.55%   15.86%  │  0x00007f65588d82e0: inc    %r11d              ;*iinc
                      │                                                ; - org.openjdk.LoopInc::indirectInc@56 (line 33)
                      │  0x00007f65588d82e3: cmp    $0x3e8,%r11d
                      ╰  0x00007f65588d82ea: jl     0x00007f65588d8260  ;*if_icmpge
                                                               ; - org.openjdk.LoopInc::indirectInc@8 (line 33)
    

    Notice the cmovne instead of jmp -- this is why we have more "predictable" branches. HotSpot profiles the branches, and emits the conditional move when the branch profile branch is very flat. In other words, dodge a very likely branch misprediction by paying a bit for the added latency of conditional move. However, in this case, switch is special: it has more than two alternatives (0, 1, and "nothing"). This is why, I speculate, the result increment is not being folded into cmov. (Generally speaking, HotSpot could have stored zero to result in "default", but it blew it, oh well)

  5. To confirm that hypothesis, let's make a directCompleteInc case, where we still use switch, but now cover all the cases:

    @Benchmark
    public int directCompleteInc() {
        int result = 0;
        for (int i = 0; i < 1000; i++) {
            switch (getValue()) {
                case 1:
                    result++;
                    break;
                default:
                    break;
            }
        }
        return result;
    }
    

    ...and measure it, and this time without any options, like the OP did:

    Benchmark                   Mode  Cnt    Score    Error   Units
    LoopInc.directCompleteInc  thrpt    5  644.414 ±  0.371  ops/ms
    LoopInc.directInc          thrpt    5  174.974 ±  0.103  ops/ms
    LoopInc.indirectInc        thrpt    5  644.015 ±  0.533  ops/ms
    

    THERE.

  6. Confirm directCompleteInc is using cmov with -prof perfasm. It does.

  7. Drink up.