Is Java 7 using Tim Sort for the Method Arrays.Sort?
Java 7 uses Dual-Pivot Quicksort for primitives and TimSort for objects.
According to the Java 7 API doc for primitives:
Implementation note: The sorting algorithm is a Dual-Pivot Quicksort by Vladimir Yaroslavskiy, Jon Bentley, and Joshua Bloch. This algorithm offers O(n log(n)) performance on many data sets that cause other quicksorts to degrade to quadratic performance, and is typically faster than traditional (one-pivot) Quicksort implementations.
According to the Java 7 API doc for objects:
The implementation was adapted from Tim Peters's list sort for Python ( TimSort). It uses techiques from Peter McIlroy's "Optimistic Sorting and Information Theoretic Complexity", in Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp 467-474, January 1993.
Timsort is a hybrid "merge sort and insertion sort."
Not sure if this is much different from what it was in Java 6, for Arrays.sort JDK6:
a tuned quicksort, adapted from Jon L. Bentley and M. Douglas McIlroy's "Engineering a Sort Function", Software-Practice and Experience, Vol. 23(11) P. 1249-1265 (November 1993)
For Object[] or collections (Collections.sort()) merge sort is used.
Yes! ... and also no.
Summary
At the time of writing (2016), in current Open JDK 0 implementations Tim Sort is generally used for sorting arrays of objects (i.e., Arrays.sort(Object[])
and friends) - but for primitive arrays (the remainder of the Arrays.sort
methods) a variety of other methods are used.
For primitives, the heuristics choose among a variety of sorting methods such as quicksort, merge sort, counting sort3. depending on the data being sorted. Most of these decisions are simply made up-front based on the type and size of the array being sorted, but for int
and long
elements the decision is actually adaptive based on the measured sortedness of the array. So you have adaptation/introspection (the heuristics to pick an algorithm) on top of adaptation/introspection (TimSort or similar merge sort) in many cases!
Details
Tim Sort is used for most sorts of objects, such as Arrays.sort(Object[] a)
, unless the user has specifically requested the legacy behavior by setting system property java.util.Arrays.useLegacyMergeSort
to true.
For primitives, the situation is more complex. At least as of JDK 8 (version 1.8.0_111
) a variety of heurstics are used depending on the size of the arrays being sorted, the primitive type and the measured "sortedness" of the array. Here's an overview:
- For all primitive types other than bytes1, arrays of less than 47 elements are simply sorted using insertion sort (see
DualPivotQuicksort.INSERTION_SORT_THRESHOLD
). This threshold is also used when sorting sub-arrays that arise when merge or quicksort are used and the size of the subarray falls below the threshold. So some form of insertion sort will be used in all sorts, and for small arrays it is the only algorithm used. - For primitive types
byte
,short
andchar
, a counting sort is used for largish arrays. This is a simple sort that takesO(n + range)
time, whererange
is the total number of byte (256) or short/char (65536) values. The sort involves allocating an underlying array ofrange
values, so it is only used when the number of elements to sort is a significant fraction of the total range. In particular, it is used for byte arrays greater than 29 elements (i.e. ~11% of the range), and short/char arrays greater than 3200 elements (~5% of the range). - For byte arrays, one of the two approaches above is always used.
- For
int
andlong
arrays above the insertion sort threshold and forshort
/char
arrays both above the insertion sort threshold and below the counting sort threshold, one of two algorithms may be used: dual pivot quicksort, or merge sort. Which one is used depends on a measure of the sortedness of the array: the input is divided into runs of ascending or descending elements. If the number of such runs is greater than 66, then the array is considered mostly unsorted and is sorted with dual-pivot quicksort. Otherwise, the array is considered mostly sorted, and mergesort is used (using the already enumerated runs as a starting point).
The idea of finding runs and then using mergesort to sort them is in fact very similar to TimSort, although there are some differences. So at least for some parameters, the JDK is using a run-aware mergesort, but for many other combinations of parameters it is using a different algorithm, and at least 5 distinct algorithms are used in total!
Rationale
The reasoning behind the different sort behavior of Object[]
versus primitive is probably at least two-fold:
1) Sorts of Object[]
are required to be stable: objects which sort equally will appear in the same order as the input. For primitive arrays, no such concept exists: primitives are fully defined by their value, so there is no distinction between a stable and an unstable sort. This allows primitive sorts to dispense with the need for stable algorithms in favor of speed.
2) Sorts of Object[]
need to involve the Object.compare()
method, which may be arbitrarily complex and expensive. Even if the compare()
method is simple, there will generally be method call overhead unless the entire sort method can be inlined2. So sorts of Object[]
will generally be biased towards minimizing total comparisons, even at the cost of some additional algorithmic complexity.
Sorts of primitive arrays, on the other hand, just directly compare primitive values which typically takes on the order of a cycle or two. In this case, the algorithm should be optimized considering both the cost of comparisons and the surrounding algorithm, since the are likely to be of the same magnitude.
0 At least for versions between Java 7 and Java 9, and of course this also includes Oracle's JDK as it is based on Open JDK. It is likely that other implementations use a similar approach, but I haven't checked.
1 For byte arrays, the insertion sort threshold is effectively 29 elements since that's the lower cutoff above which counting sort is used.
2 This seems unlikely, since it is quite large.
3 Counting sort is only used for values with relatively limited range of 16-bits or less: byte
, short
or char
.
Yes, Java 7 will use Timsort for Arrays.sort. Here is the commit: http://hg.openjdk.java.net/jdk7/jdk7/jdk/rev/bfd7abda8f79