When should streams be preferred over traditional loops for best performance? Do streams take advantage of branch-prediction?
I agree to the point that programming with streams is nice and easier for some scenarios but when we're losing out on performance, why do we need to use them?
Performance is rarely an issue. It would be usual for 10% of your streams would need to be rewritten as loops to get the performance you need.
Is there something I'm missing out on?
Using parallelStream() is much easier using streams and possibly more efficient as it's hard to write efficient concurrent code.
Which is the scenario in which streams perform equal to loops? Is it only in the case where your function defined takes a lot of time, resulting in a negligible loop performance?
Your benchmark is flawed in the sense that the code hasn't been compiled when it starts. I would do the whole test in a loop as JMH does, or I would use JMH.
In none of the scenario's I could see streams taking advantage of branch-prediction
Branch prediction is a CPU feature not a JVM or streams feature.
Java is a high level language saving the programmer from considering low level performance optimization.
Never choose a certain approach for performance reasons unless you have proven that this is a problem in your real application.
Your measurements show some negative effect for streams, but the difference is below observability. Therefore, it's not a Problem. Also, this Test is a "synthetic" situation and the code may behave completely different in a heavy duty production environment. Furthermore, the machine code created from your Java (byte) code by the JIT may change in future Java (maintenance) releases and make your measurements obsolete.
In conclusion: Choose the syntax or approach that most expresses your (the programmer's) intention. Keep to that same approach or syntax throughout the program unless you have a good reason to change.
Everything is said, but I want to show you how your code should look like using JMH.
@Fork(3)
@BenchmarkMode(Mode.AverageTime)
@Measurement(iterations = 10, timeUnit = TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
@Threads(1)
@Warmup(iterations = 5, timeUnit = TimeUnit.NANOSECONDS)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class MyBenchmark {
private final int totalSize = 32_768;
private final int filterValue = 1_280;
private final int loopCount = 10_000;
// private Random rnd;
private int[] array;
@Setup
public void setup() {
array = IntStream.range(0, totalSize).toArray();
// rnd = new Random(0);
// array = rnd.ints(totalSize).map(i -> i % 2560).toArray();
}
@Benchmark
public long conditionalOperatorTime() {
long sum = 0;
for (int j = 0; j < loopCount; j++) {
for (int c = 0; c < totalSize; ++c) {
sum += array[c] >= filterValue ? array[c] : 0;
}
}
return sum;
}
@Benchmark
public long branchStatementTime() {
long sum = 0;
for (int j = 0; j < loopCount; j++) {
for (int c = 0; c < totalSize; ++c) {
if (array[c] >= filterValue) {
sum += array[c];
}
}
}
return sum;
}
@Benchmark
public long streamsTime() {
long sum = 0;
for (int j = 0; j < loopCount; j++) {
sum += IntStream.of(array).filter(value -> value >= filterValue).sum();
}
return sum;
}
@Benchmark
public long parallelStreamsTime() {
long sum = 0;
for (int j = 0; j < loopCount; j++) {
sum += IntStream.of(array).parallel().filter(value -> value >= filterValue).sum();
}
return sum;
}
}
The results for a sorted array:
Benchmark Mode Cnt Score Error Units
MyBenchmark.branchStatementTime avgt 30 119833793,881 ± 1345228,723 ns/op
MyBenchmark.conditionalOperatorTime avgt 30 118146194,368 ± 1748693,962 ns/op
MyBenchmark.parallelStreamsTime avgt 30 499436897,422 ± 7344346,333 ns/op
MyBenchmark.streamsTime avgt 30 1126768177,407 ± 198712604,716 ns/op
Results for unsorted data:
Benchmark Mode Cnt Score Error Units
MyBenchmark.branchStatementTime avgt 30 534932594,083 ± 3622551,550 ns/op
MyBenchmark.conditionalOperatorTime avgt 30 530641033,317 ± 8849037,036 ns/op
MyBenchmark.parallelStreamsTime avgt 30 489184423,406 ± 5716369,132 ns/op
MyBenchmark.streamsTime avgt 30 1232020250,900 ± 185772971,366 ns/op
I only can say that there are many possibilities of JVM optimizations and maybe branch-prediction is also involved. Now it is up to you to interpret the benchmark results.
I will add my 0.02$ here.
I just read about Branch-Prediction and wanted to try how this works with Java 8 Streams
Branch Prediction is a CPU feature, it has nothing to do with JVM. It is needed to keep CPU pipeline full and ready to do something. Measuring or predicting the branch prediction is extremely hard (unless you actually know the EXACT things that the CPU will do). This will depend on at least the load that the CPU is having right now (that might be a lot more than your program only).
However the performance with Streams is always turning out to be worse than traditional loops
This statement and the previous one are un-related. Yes, streams will be slower for simple examples like yours, up to 30% slower, which is OK. You could measure for a particular case how slower they are or faster via JMH as others have suggested, but that proves only that case, only that load.
At the same time you might be working with Spring/Hibernate/Services, etc etc that do things in milliseconds and your streams in nano-seconds and you worry about the performance? You are questioning the speed of your fastest part of the code? That's of course a theoretical thing.
And about your last point that you tried with sorted and un-sorted arrays and it gives you bad results. This is absolutely no indication of branch prediction or not - you have no idea at which point the prediction happened and if it did unless you can look inside the actual CPU pipelines - which you did not.