Which parallel sorting algorithm has the best average case performance?
Solution 1:
The following article (PDF download) is a comparative study of parallel sorting algorithms on various architectures:
Parallel sorting algorithms on various architectures
According to the article, sample sort seems to be best on many parallel architecture types.
Update to address Mark's concern of age:
Here are more recent articles introducing something more novel (from 2007, which, btw, still get compared with sample sort):
Improvements on sample sort
AA-Sort
The bleeding edge (circa 2010, some only a couple months old):
Parallel sorting pattern
Many-core GPU based parallel sorting
Hybrid CPU/GPU parallel sort
Randomized Parallel Sorting Algorithm with an Experimental Study
Highly scalable parallel sorting
Sorting N-Elements Using Natural Order: A New Adaptive Sorting Approach
Update for 2013:
Here is the bleeding edge circa January, 2013. (Note: A few of the links are to papers at Citeseer and require registration which is free):
University lectures:
Parallel Partitioning for Selection and Sorting
Parallel Sorting Algorithms Lecture
Parallel Sorting Algorithms Lecture 2
Parallel Sorting Algorithms Lecture 3
Other sources and papers:
A novel sorting algorithm for many-core architectures based on adaptive bitonic sort
Highly Scalable Parallel Sorting 2
Parallel Merging
Parallel Merging 2
Parallel Self-Sorting System for Objects
Performance Comparison of Sequential Quick Sort and Parallel Quick Sort Algorithms
Shared Memory, Message Passing, and Hybrid Merge Sorts for Standalone and Clustered SMPs
Various parallel algorithms (sorting et al) including implementations
GPU and CPU/GPU hybrid sources and papers:
An OpenCL Method of Parallel Sorting Algorithms for GPU Architecture
Data Sorting Using Graphics Processing Units
Efficient Algorithms for Sorting on GPUs
Designing efficient sorting algorithms for manycore GPUs
Deterministic Sample Sort For GPUs
Fast in-place sorting with CUDA based on bitonic sort
Fast parallel GPU-sorting using a hybrid algorithm
Fast Parallel Sorting Algorithms on GPUs
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort
GPU sample sort
GPU-ABiSort: Optimal Parallel Sorting on Stream Architectures
GPUTeraSort: high performance graphics co-processor sorting for large database management
High performance comparison-based sorting algorithm on many-core GPUs
Parallel external sorting for CUDA-enabled GPUs with load balancing and low transfer overhead
Sorting on GPUs for large scale datasets: A thorough comparison
Update for 2021: I have not forgotten this answer and like all things computer related, it has not aged well. I will do my best to update and refresh it for current trends as well as the state of the art, at some point prior to the end of this year (2021).
Solution 2:
I have worked with both a Parallel Quicksort algorithm and a PSRS algorithm that essentially combines quicksort in parallel with merging.
With the Parallel Quicksort algorithm, I have demonstrated near linear speedup with up to 4 cores (dual core with hyper-threading), which is expected given the limitations of the algorithm. A pure Parallel Quicksort relies on a shared stack resource which will result in contention among threads, thus reducing any gain in performance. The advantage of this algorithm is that it sorts 'in-place,' which reduces the amount of memory needed. You may want to consider this when sorting upwards of 100M elements as you stated.
I see you are looking to sort on a system with 8-32 cores. The PSRS algorithm avoids contention at the shared resource, allowing speedup at higher numbers of processes. I have demonstrated the algorithm with up to 4 cores as above, but experimental results of others report near linear speedup with much larger numbers of core, 32 and beyond. The disadvantage of the PSRS algorithm is that it is not in-place and will require considerably more memory.
If you are interested, you may use or peruse my Java code for each of these algorithms. You can find it on github: https://github.com/broadbear/sort. The code is intended as a drop-in replacement of Java Collections.sort(). If you are looking for the ability to perform parallel sorting in a JVM as you state above, the code in my repo may help you out. The API is fully genericized for elements implementing Comparable or implementing your own Comparator.
May I ask what you are looking to sort that many elements for? I'm interested to know of potential applications for my sorting package.
Solution 3:
Take a look at this paper: A Scalable Parallel Sorting Algorithm Using Exact Splitting. It is concerned with many more than 32 cores. However, it describes in detail an algorithm, which has a running time complexity of O(n/p * log(n) + p * log(n)**2) and is applicable for arbitrary comparators.
Solution 4:
The paper "Comparison of Parallel Sorting Algorithms on Different Architectures" may be a good place for you to start.