Why does running multiple lambdas in loops suddenly slow down?
You can actually replicate this with JMH SingleShot mode:
@BenchmarkMode(Mode.SingleShotTime)
@Warmup(iterations = 0)
@Measurement(iterations = 1)
@Fork(1)
public class Lambdas {
@Benchmark
public static void doOne() {
execute(() -> {});
}
@Benchmark
public static void doFour() {
execute(() -> {});
execute(() -> {});
execute(() -> {});
execute(() -> {});
}
public static void execute(Runnable task) {
for (int i = 0; i < 100_000_000; i++) {
task.run();
}
}
}
Benchmark Mode Cnt Score Error Units
Lambdas.doFour ss 0.446 s/op
Lambdas.doOne ss 0.006 s/op
If you look at -prof perfasm
profile for doFour
test, you would get a fat clue:
....[Hottest Methods (after inlining)]..............................................................
32.19% c2, level 4 org.openjdk.Lambdas$$Lambda$44.0x0000000800c258b8::run, version 664
26.16% c2, level 4 org.openjdk.Lambdas$$Lambda$43.0x0000000800c25698::run, version 658
There are at least two hot lambdas, and those are represented by different classes. So what you are seeing is likely monomorphic (one target), then bimorphic (two targets), then polymorphic virtual call at task.run
.
Virtual call has to choose which class to call the implementation from. The more classes you have, the worse it gets for optimizer. JVM tries to adapt, but it gets worse and worse as the run progresses. Roughly like this:
execute(() -> {}); // compiles with single target, fast
execute(() -> {}); // recompiles with two targets, a bit slower
execute(() -> {}); // recompiles with three targets, slow
execute(() -> {}); // continues to be slow
Now, the elimination of the loop requires seeing through the task.run()
. In monomorphic and bimorphic cases it is easy: one or both targets are inlined, their empty body is discovered, done. In both cases, you would have to do typechecks, which means it is not completely free, with bimorphic costing a bit extra. When you experience a polymorphic call, there is no such luck at all: it is opaque call.
You can add two more benchmarks in the mix to see it:
@Benchmark
public static void doFour_Same() {
Runnable l = () -> {};
execute(l);
execute(l);
execute(l);
execute(l);
}
@Benchmark
public static void doFour_Pair() {
Runnable l1 = () -> {};
Runnable l2 = () -> {};
execute(l1);
execute(l1);
execute(l2);
execute(l2);
}
Which would then yield:
Benchmark Mode Cnt Score Error Units
Lambdas.doFour ss 0.445 s/op ; polymorphic
Lambdas.doFour_Pair ss 0.016 s/op ; bimorphic
Lambdas.doFour_Same ss 0.008 s/op ; monomorphic
Lambdas.doOne ss 0.006 s/op
This also explains why your "fixes" help:
using 1-2 nested classes instead of lambdas,
Bimorphic inlining.
using 1-2 lambda instances instead of 4 different ones,
Bimorphic inlining.
not calling task.run() lambdas inside the loop,
Avoids polymorphic (opaque) call in the loop, allows loop elimination.
inlining the execute() method, still maintaining 4 different lambdas.
Avoids a single call site that experiences multiple call targets. In other words, turns a single polymorphic call site into a series of monomorphic call sites each with its own target.