Is there ever a reason not to use Java 8's parallelSort?

I was reading this question about the differences between Java's Arrays.sort and Arrays.parallelSort, which is a few years old by now. What surprised me is that there was only one question that mentioned any downside to using parallelSort; namely that the speedup decreases if you're using a lot of your CPU.

Assuming you're not in some sort of specialized, single-threaded environment, should one always choose parallelSort? Is there ever a reason not to? Note that one of the answers to the question above mentions that if there are fewer than 4096 elements, parallelSort simply calls sort anyway.


Solution 1:

There are some downsides to using Arrays.parallelSort

  • it uses the ForkJoinPool.commonPool() and will fight with other functions that use it by default (e.g. parallel() on a stream)
  • the thread-pool Arrays.parallelSort uses is not configurable (only on a global level by increasing the common-pools thread amount)
  • it performs worse on small data sets (more often than not arrays contain few elements, the JDK even acknowledges that e.g. most ArrayList stay empty for their whole lifetime which saves quite a bit of memory and CPU time for not instantiating arrays that will never be filled)

And another anecdotal scenario: say if you implement some card game that needs sorting. Its embarassingly easy to parallelize multiple game executions next to each other instead of parallelizing the sorting mechanism of one run which may only take a fraction of the whole game loop. You lost an easy way to parallelize now (e.g. when running the game in the context of genetic algorithms).

But yes, if you happen to have large arrays and sorting is a substantial part of your applications runtime use Arrays.parallelSort.

EDIT: And even if Arrays.parallelSort switches to a normal sort if the given array has less than 4096 elements: it's all about showing intention - you want a parallel sort if possible which holds a different meaning than just calling sort. And to be nitpicky: it does indeed perform worse on small arrays as it has to do the additional check if the array is containing less than 4096 elements and some other checks about the common pools thread count (whichs overhead is of course negligible) :).

Solution 2:

This is not much different than the question when to use stream() vs parallelStream() - it depends on how much data you have. Of course the majority of time, when sorting 10 elements in parallel, will be consumed by the threading framework that is under the hood (which is un-specified by the documentation), not by the sorting itself.

But you also have to wonder why such methods are introduced IMO. Hardware is moving (has already moved?) towards many CPU's, not more GHz, so doing things in parallel is only a normal course for any language that wants to still be alive in the next 20 years.

As to how much data you need to actually be performant for parallelSort as opposed to sort, plus knowing that we need at least MIN_ARRAY_SORT_GRAN + 1 to gain any potential benefit; writing a proper test to prove that for this particular set-up and run, you would need at least X numbers, is not that complicated. You also have to take into account that some arrays might be already sorted (explained further), while some might be totally un-sorted (5,4,3,2,1 for example), this brings some penalties for the second one.

Taking some random data and making a test:

@Warmup(iterations = 10)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Measurement(iterations = 2, time = 2, timeUnit = TimeUnit.SECONDS)
public class ParallelSort {

    public static void main(String[] args) throws Exception {
        Options opt = new OptionsBuilder()
            .include(ParallelSort.class.getName())
            .build();

        new Runner(opt).run();
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(1)
    public int[] parallel(ParallelSortExecutionPlan plan) {
        Arrays.parallelSort(plan.ints());
        return plan.ints();
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(1)
    public int[] nonParallel(ParallelSortExecutionPlan plan) {
        Arrays.sort(plan.ints());
        return plan.ints();
    }
}


@State(Scope.Benchmark)
public class ParallelSortExecutionPlan {

    @Param(value = {"10", "100", "1000", "10000", "100000", "1000000"})
    private int howMany;

    private int[] ints;

    public static void main(String[] args) {
    }

    @Setup(Level.Invocation)
    public void setUp() {
        ints = new int[howMany];
        for (int i = 0; i < howMany; ++i) {
            ints[i] = ThreadLocalRandom.current().nextInt();
        }
    }

    int[] ints() {
        return ints;
    }
}

Just notice that the second class is using @Setup(Level.Invocation) (if you know a bit of JMH) - this is a very sharp tool here; but I uses it because I want an un-sorted array for each Invocation of the method. As otherwise, if Trial would have been used for example - only the first call would be an un-sorted array, all other calls of the @Benhcmark method would already be sorted. For the fun of it, you could change that single line to @Setup(Level.Trial) for example and see the results, they will make very little sense.

Running this reveals:

Benchmark                 (howMany)  Mode  Cnt         Score   Error  Units

ParallelSort.nonParallel         10  avgt    2       128.847          ns/op
ParallelSort.parallel            10  avgt    2       116.656          ns/op

ParallelSort.nonParallel        100  avgt    2      1956.746          ns/op
ParallelSort.parallel           100  avgt    2      1963.335          ns/op

ParallelSort.nonParallel       1000  avgt    2     32162.611          ns/op
ParallelSort.parallel          1000  avgt    2     31716.915          ns/op

ParallelSort.nonParallel      10000  avgt    2    423531.663          ns/op
ParallelSort.parallel         10000  avgt    2    201802.609          ns/op

ParallelSort.nonParallel     100000  avgt    2   6503511.987          ns/op
ParallelSort.parallel        100000  avgt    2   1363169.661          ns/op

ParallelSort.nonParallel    1000000  avgt    2  69058738.586          ns/op
ParallelSort.parallel       1000000  avgt    2  13469112.930          ns/op

Pretty much a very expected output to me.

Solution 3:

No, I would say no for small enough arrays. The overhead of setting up the threads won't result in an observable speed up.

The key is "small enough". It won't be the same answer for all problems.

Dogma should never be applied, except in the case of this dogma rule. Just like the only thing we should never tolerate is intolerance. There's a Popper paradox in there somewhere.