Parallel stream from a HashSet doesn't run in parallel
I have collection of elements that I want to process in parallel. When I use a List
, parallelism works. However, when I use a Set
, it does not run in parallel.
I wrote a code sample that shows the problem:
public static void main(String[] args) {
ParallelTest test = new ParallelTest();
List<Integer> list = Arrays.asList(1,2);
Set<Integer> set = new HashSet<>(list);
ForkJoinPool forkJoinPool = new ForkJoinPool(4);
System.out.println("set print");
try {
forkJoinPool.submit(() ->
set.parallelStream().forEach(test::print)
).get();
} catch (Exception e) {
return;
}
System.out.println("\n\nlist print");
try {
forkJoinPool.submit(() ->
list.parallelStream().forEach(test::print)
).get();
} catch (Exception e) {
return;
}
}
private void print(int i){
System.out.println("start: " + i);
try {
TimeUnit.SECONDS.sleep(1);
} catch (InterruptedException e) {
}
System.out.println("end: " + i);
}
This is the output that I get on windows 7
set print
start: 1
end: 1
start: 2
end: 2
list print
start: 2
start: 1
end: 1
end: 2
We can see that the first element from the Set
had to finish before the second element is processed. For the List
, the second element starts before the first element finishes.
Can you tell me what causes this issue, and how to avoid it using a Set
collection?
Solution 1:
I can reproduce the behavior you see, where the parallelism doesn't match the parallelism of the fork-join pool parallelism you've specified. After setting the fork-join pool parallelism to 10, and increasing the number of elements in the collection to 50, I see the parallelism of the list-based stream rising only to 6, whereas the parallelism of the set-based stream never gets above 2.
Note, however, that this technique of submitting a task to a fork-join pool to run the parallel stream in that pool is an implementation "trick" and is not guaranteed to work. Indeed, the threads or thread pool that is used for execution of parallel streams is unspecified. By default, the common fork-join pool is used, but in different environments, different thread pools might end up being used. (Consider a container within an application server.)
In the java.util.stream.AbstractTask class, the LEAF_TARGET
field determines the amount of splitting that is done, which in turn determines the amount of parallelism that can be achieved. The value of this field is based on ForkJoinPool.getCommonPoolParallelism()
which of course uses the parallelism of the common pool, not whatever pool happens to be running the tasks.
Arguably this is a bug (see OpenJDK issue JDK-8190974), however, this entire area is unspecified anyway. However, this area of the system definitely needs development, for example in terms of splitting policy, the amount of parallelism available, dealing with blocking tasks, among other issues. A future release of the JDK may address some of these issues.
Meanwhile, it is possible to control the parallelism of the common fork-join pool through the use of system properties. If you add this line to your program,
System.setProperty("java.util.concurrent.ForkJoinPool.common.parallelism", "10");
and you run the streams in the common pool (or if you submit them to your own pool that has a sufficiently high level of parallelism set) you will observe that many more tasks are run in parallel.
You can also set this property on the command line using the -D
option.
Again, this is not guaranteed behavior, and it may change in the future. But this technique will probably work for JDK 8 implementations for the forseeable future.
UPDATE 2019-06-12: The bug JDK-8190974 was fixed in JDK 10, and the fix has been backported to an upcoming JDK 8u release (8u222).