Calculate the median of a billion numbers

Ah, my brain has just kicked into gear, I have a sensible suggestion now. Probably too late if this had been an interview, but never mind:

Machine 1 shall be called the "control machine", and for the sake of argument either it starts with all the data, and sends it in equal parcels to the other 99 machines, or else the data starts evenly distributed between the machines, and it sends 1/99 of its data to each of the others. The partitions do not have to be equal, just close.

Each other machine sorts its data, and does so in a way which favours finding the lower values first. So for example a quicksort, always sorting the lower part of the partition first[*]. It writes its data back to the control machine in increasing order as soon as it can (using asynchronous IO so as to continue sorting, and probably with Nagle on: experiment a bit).

The control machine performs a 99-way merge on the data as it arrives, but discards the merged data, just keeping count of the number of values it has seen. It calculates the median as the mean of the 1/2 billionth and 1/2 billion plus oneth values.

This suffers from the "slowest in the herd" problem. The algorithm cannot complete until every value less than the median has been sent by a sorting machine. There's a reasonable chance that one such value will be quite high within its parcel of data. So once the initial partitioning of the data is complete, estimated running time is the combination of the time to sort 1/99th of the data and send it back to the control computer, and the time for the control to read 1/2 the data. The "combination" is somewhere between the maximum and the sum of those times, probably close to the max.

My instinct is that for sending data over a network to be faster than sorting it (let alone just selecting the median) it needs to be a pretty damn fast network. Might be a better prospect if the network can be presumed to be instantaneous, for example if you have 100 cores with equal access to RAM containing the data.

Since network I/O is likely to be the bound, there might be some tricks you can play, at least for the data coming back to the control machine. For example, instead of sending "1,2,3,.. 100", perhaps a sorting machine could send a message meaning "100 values less than 101". The control machine could then perform a modified merge, in which it finds the least of all those top-of-a-range values, then tells all the sorting machines what it was, so that they can (a) tell the control machine how many values to "count" below that value, and (b) resume sending their sorted data from that point.

More generally, there's probably a clever challenge-response guessing game that the control machine can play with the 99 sorting machines.

This involves round-trips between the machines, though, which my simpler first version avoids. I don't really know how to blind-estimate their relative performance, and since the trade-offs are complex, I imagine there are much better solutions out there than anything I'll think of myself, assuming this is ever a real problem.

[*] available stack permitting - your choice of which part to do first is constrained if you don't have O(N) extra space. But if you do have enough extra space, you can take your pick, and if you don't have enough space you can at least use what you do have to cut some corners, by doing the small part first for the first few partitions.


sort -g numbers | head -n 500000001 | tail -n 2 | dc -e "1 k ? ? + 2 / p"

I hate to be the contrarian here, but I don't believe sorting is required, and I think any algorithm involving sorting a billion/100 numbers is going to be slow. Let's consider an algorithm on one computer.

1) Select 1000 values at random from the billion, and use them to get an idea of the distribution of the numbers, especially a range.

2) Instead of sorting the values, allocate them to buckets based on the distribution you just calculated. The number of buckets is chosen so that the computer can handle them efficiently, but should otherwise be as large as convenient. The bucket ranges should be so that approximately equal numbers of values go in each bucket (this isn't critical to the algorithm, but it helps efficiency. 100,000 buckets might be appropriate). Note the number of values in each bucket. This is an O(n) process.

3) Find out which bucket range the median lies. This can be done by simply examining the total numbers in each bucket.

4) Find the actual median by examining the values in that bucket. You can use a sort here if you like, since you are only sorting maybe 10,000 numbers. If the number of values in that bucket is large then you can use this algorithm again until you have a small enough number to sort.

This approach parallelizes trivially by dividing the values between the computers. Each computer reports the totals in each bucket to a 'control' computer which does step 3. For step 4 each computer sends the (sorted) values in the relevant bucket to the control computer (you can do both of those algorithms in parallel too, but it probably isn't worth it).

The total process is O(n), since both steps 3 and 4 are trivial, provided the number of buckets is large enough.