When should streams be preferred over traditional loops for best performance? Do streams take advantage of branch-prediction?

I agree to the point that programming with streams is nice and easier for some scenarios but when we're losing out on performance, why do we need to use them?

Performance is rarely an issue. It would be usual for 10% of your streams would need to be rewritten as loops to get the performance you need.

Is there something I'm missing out on?

Using parallelStream() is much easier using streams and possibly more efficient as it's hard to write efficient concurrent code.

Which is the scenario in which streams perform equal to loops? Is it only in the case where your function defined takes a lot of time, resulting in a negligible loop performance?

Your benchmark is flawed in the sense that the code hasn't been compiled when it starts. I would do the whole test in a loop as JMH does, or I would use JMH.

In none of the scenario's I could see streams taking advantage of branch-prediction

Branch prediction is a CPU feature not a JVM or streams feature.

Java is a high level language saving the programmer from considering low level performance optimization.

Never choose a certain approach for performance reasons unless you have proven that this is a problem in your real application.

Your measurements show some negative effect for streams, but the difference is below observability. Therefore, it's not a Problem. Also, this Test is a "synthetic" situation and the code may behave completely different in a heavy duty production environment. Furthermore, the machine code created from your Java (byte) code by the JIT may change in future Java (maintenance) releases and make your measurements obsolete.

In conclusion: Choose the syntax or approach that most expresses your (the programmer's) intention. Keep to that same approach or syntax throughout the program unless you have a good reason to change.

Everything is said, but I want to show you how your code should look like using JMH.

@Fork(3)
@BenchmarkMode(Mode.AverageTime)
@Measurement(iterations = 10, timeUnit = TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
@Threads(1)
@Warmup(iterations = 5, timeUnit = TimeUnit.NANOSECONDS)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class MyBenchmark {

  private final int totalSize = 32_768;
  private final int filterValue = 1_280;
  private final int loopCount = 10_000;
  // private Random rnd;

  private int[] array;

  @Setup
  public void setup() {
    array = IntStream.range(0, totalSize).toArray();

    // rnd = new Random(0);
    // array = rnd.ints(totalSize).map(i -> i % 2560).toArray();
  }

  @Benchmark
  public long conditionalOperatorTime() {
    long sum = 0;
    for (int j = 0; j < loopCount; j++) {
      for (int c = 0; c < totalSize; ++c) {
        sum += array[c] >= filterValue ? array[c] : 0;
      }
    }
    return sum;
  }

  @Benchmark
  public long branchStatementTime() {
    long sum = 0;
    for (int j = 0; j < loopCount; j++) {
      for (int c = 0; c < totalSize; ++c) {
        if (array[c] >= filterValue) {
          sum += array[c];
        }
      }
    }
    return sum;
  }

  @Benchmark
  public long streamsTime() {
    long sum = 0;
    for (int j = 0; j < loopCount; j++) {
      sum += IntStream.of(array).filter(value -> value >= filterValue).sum();
    }
    return sum;
  }

  @Benchmark
  public long parallelStreamsTime() {
    long sum = 0;
    for (int j = 0; j < loopCount; j++) {
      sum += IntStream.of(array).parallel().filter(value -> value >= filterValue).sum();
    }
    return sum;
  }
}

The results for a sorted array:

Benchmark                            Mode  Cnt           Score           Error  Units
MyBenchmark.branchStatementTime      avgt   30   119833793,881 ±   1345228,723  ns/op
MyBenchmark.conditionalOperatorTime  avgt   30   118146194,368 ±   1748693,962  ns/op
MyBenchmark.parallelStreamsTime      avgt   30   499436897,422 ±   7344346,333  ns/op
MyBenchmark.streamsTime              avgt   30  1126768177,407 ± 198712604,716  ns/op

Results for unsorted data:

Benchmark                            Mode  Cnt           Score           Error  Units
MyBenchmark.branchStatementTime      avgt   30   534932594,083 ±   3622551,550  ns/op
MyBenchmark.conditionalOperatorTime  avgt   30   530641033,317 ±   8849037,036  ns/op
MyBenchmark.parallelStreamsTime      avgt   30   489184423,406 ±   5716369,132  ns/op
MyBenchmark.streamsTime              avgt   30  1232020250,900 ± 185772971,366  ns/op

I only can say that there are many possibilities of JVM optimizations and maybe branch-prediction is also involved. Now it is up to you to interpret the benchmark results.

I will add my 0.02$ here.

I just read about Branch-Prediction and wanted to try how this works with Java 8 Streams

Branch Prediction is a CPU feature, it has nothing to do with JVM. It is needed to keep CPU pipeline full and ready to do something. Measuring or predicting the branch prediction is extremely hard (unless you actually know the EXACT things that the CPU will do). This will depend on at least the load that the CPU is having right now (that might be a lot more than your program only).

However the performance with Streams is always turning out to be worse than traditional loops

This statement and the previous one are un-related. Yes, streams will be slower for simple examples like yours, up to 30% slower, which is OK. You could measure for a particular case how slower they are or faster via JMH as others have suggested, but that proves only that case, only that load.

At the same time you might be working with Spring/Hibernate/Services, etc etc that do things in milliseconds and your streams in nano-seconds and you worry about the performance? You are questioning the speed of your fastest part of the code? That's of course a theoretical thing.

And about your last point that you tried with sorted and un-sorted arrays and it gives you bad results. This is absolutely no indication of branch prediction or not - you have no idea at which point the prediction happened and if it did unless you can look inside the actual CPU pipelines - which you did not.

When should streams be preferred over traditional loops for best performance? Do streams take advantage of branch-prediction?

Related

Recent Posts