Fastest way to sort 10 numbers? (numbers are 32 bit)

Solution 1:

(Following up on the suggestion of HelloWorld to look into sorting networks.)

It seems that a 29-comparison/swap network is the fastest way to do a 10-input sort. I used the network discovered by Waksman in 1969 for this example in JavaScript, which should translate directly into C, as it's just a list of if statements, comparisons and swaps.

function sortNet10(data) {    // ten-input sorting network by Waksman, 1969
    var swap;
    if (data[0] > data[5]) { swap = data[0]; data[0] = data[5]; data[5] = swap; }
    if (data[1] > data[6]) { swap = data[1]; data[1] = data[6]; data[6] = swap; }
    if (data[2] > data[7]) { swap = data[2]; data[2] = data[7]; data[7] = swap; }
    if (data[3] > data[8]) { swap = data[3]; data[3] = data[8]; data[8] = swap; }
    if (data[4] > data[9]) { swap = data[4]; data[4] = data[9]; data[9] = swap; }
    if (data[0] > data[3]) { swap = data[0]; data[0] = data[3]; data[3] = swap; }
    if (data[5] > data[8]) { swap = data[5]; data[5] = data[8]; data[8] = swap; }
    if (data[1] > data[4]) { swap = data[1]; data[1] = data[4]; data[4] = swap; }
    if (data[6] > data[9]) { swap = data[6]; data[6] = data[9]; data[9] = swap; }
    if (data[0] > data[2]) { swap = data[0]; data[0] = data[2]; data[2] = swap; }
    if (data[3] > data[6]) { swap = data[3]; data[3] = data[6]; data[6] = swap; }
    if (data[7] > data[9]) { swap = data[7]; data[7] = data[9]; data[9] = swap; }
    if (data[0] > data[1]) { swap = data[0]; data[0] = data[1]; data[1] = swap; }
    if (data[2] > data[4]) { swap = data[2]; data[2] = data[4]; data[4] = swap; }
    if (data[5] > data[7]) { swap = data[5]; data[5] = data[7]; data[7] = swap; }
    if (data[8] > data[9]) { swap = data[8]; data[8] = data[9]; data[9] = swap; }
    if (data[1] > data[2]) { swap = data[1]; data[1] = data[2]; data[2] = swap; }
    if (data[3] > data[5]) { swap = data[3]; data[3] = data[5]; data[5] = swap; }
    if (data[4] > data[6]) { swap = data[4]; data[4] = data[6]; data[6] = swap; }
    if (data[7] > data[8]) { swap = data[7]; data[7] = data[8]; data[8] = swap; }
    if (data[1] > data[3]) { swap = data[1]; data[1] = data[3]; data[3] = swap; }
    if (data[4] > data[7]) { swap = data[4]; data[4] = data[7]; data[7] = swap; }
    if (data[2] > data[5]) { swap = data[2]; data[2] = data[5]; data[5] = swap; }
    if (data[6] > data[8]) { swap = data[6]; data[6] = data[8]; data[8] = swap; }
    if (data[2] > data[3]) { swap = data[2]; data[2] = data[3]; data[3] = swap; }
    if (data[4] > data[5]) { swap = data[4]; data[4] = data[5]; data[5] = swap; }
    if (data[6] > data[7]) { swap = data[6]; data[6] = data[7]; data[7] = swap; }
    if (data[3] > data[4]) { swap = data[3]; data[3] = data[4]; data[4] = swap; }
    if (data[5] > data[6]) { swap = data[5]; data[5] = data[6]; data[6] = swap; }
    return(data);
}

alert(sortNet10([5,7,1,8,4,3,6,9,2,0]));

Here's a graphical representation of the network, divided into independent phases.

10-input sorting network (Waksman, 1969)

To take advantage of parallel processing, the 5-4-3-4-4-4-3-2 grouping can be changed into a 4-4-4-4-4-4-3-2 grouping.

10-input sorting network (Waksman, 1969) re-grouped

Solution 2:

When you deal with this fixed size, take a look at sorting networks. These algorithms have a fixed runtime and are independent of their input. For your use case, you don't have such overhead that some sorting algorithms have.

Bitonic sort is an implementation of such a network. This one works best with len(n) <= 32 on a CPU. On bigger inputs you could think of moving to a GPU.

By the way, a good page to compare sorting algorithms is this one here (though it's missing the bitonic sort):

Sorting Algorithms Animations

Solution 3:

Use a sorting network that has comparisons in groups of 4, so you can do it in SIMD registers. A pair of packed min/max instructions implements a packed comparator function. Sorry I don't have time right now to look for a page I remember seeing about this, but hopefully searching on SIMD or SSE sorting networks will turn something up.

x86 SSE does have packed-32bit-integer min and max instructions for vectors of four 32bit ints. AVX2 (Haswell and later) have the same but for 256b vectors of 8 ints. There are also efficient shuffle instructions.

If you have a lot of independent small sorts, it might be possible to do 4 or 8 sorts in parallel using vectors. Esp. if you're choosing elements randomly (so the data to be sorted won't be contiguous in memory anyway), you can avoid shuffles and simply compare in the order you need. 10 registers to hold all the data from 4 (AVX2: 8) lists of 10 ints still leaves 6 regs for scratch space.

Vector sorting networks are less efficient if you also need to sort associated data. In that case, the most efficient way seems to be to use a packed-compare to get a mask of which elements changed, and use that mask to blend vectors of (references to) associated data.