need high performance /bin/sort; any suggestions?
I'm looking for a high performance /bin/sort drop in replacement. I know there is pbzip2 for using multiple cores, but is there a similar product for /bin/sort?
I've found distsort.sh, but I want something less IO intensive. I'm looking to sort oh.. 60 gigs of data on a very frequent basis.
GNU sort
has -m which can probably help you out. Let us assume you have 200 .gz-files that you want to sort and combine. Then you could use GNU Parallel to do:
seq 1 200 | parallel mkfifo /tmp/{}
ls *.gz | nice parallel -j200 'zcat {} | sort >/tmp/$PARALLEL_SEQ' &
seq 1 200 | parallel -X sort -m /tmp/{} >/tmp/sorted
If I/O is the problem and memory is not an issue, the use -S for the first sort
to make sure everything stays in memory. Also you may want to use lzop
every time you write to disk (--compress-program=lzop): Disks are often the limiting factor so lzopping on the fly can give you extra speed. Or you could make a RAM disk and set -T to that dir.
In searching around, I found lots of references to academic papers and one commercial product called Nsort. I don't know anything about it other than that their web site claims that:
Nsort is a sort/merge program that can quickly sort large amounts of data, using large numbers of processors and disks in parallel. Unique in its CPU efficiency, Nsort is the only commercial sort program to demonstrate:
- 1 Terabyte sorts (33 minutes)
- 1 Gigabyte/sec file read and write rates
Nsort has a long history of sorting massive, production data sets, such as:
- Web logs for high-traffic web sites
- Phone logs
- Government agency data
Hrm. You're going to run into a few issues here, I think. First of all, your input data is going to have a big impact on sorting performance (different algorithms perform better or worse depending on the distribution of the input). However, a bigger problem up front is simply that 60GB is a lot of data.
Additionally, sorting doesn't paralellize as easily as compression because there's no proximity guarantees. In other words, with compression/decompression, you can break the input into discrete chunks, and operate on them each separately and independently. After each chunk is processed, they're simply concatenated together. With sorting, you've got multiple steps involved because you can't just concatenate the results (unless you do some preprocessing), you have to merge the results (because an entry at the beginning of the 60GB could end up adjacent to an entry at the end of the 60GB, after sorting).
I can basically think of a few general solutions here:
- Prepartition your data in a way that is friendly to sorting and recombining. For example, if you were doing a simple alphabetic sorting, you might store your data in 26 buckets, one for each letter of the alphabet. Then you could sort each bucket individually, and recombine them at the end. The specifics of how you prepartition your data would be dependent on the data itself, your current storage method, etc. Some setups might work better for this than others.
- Write your own sort front end that does basically what I wrote about above, but on the fly. In other words, you'd have a script that reads the input, and based on some very fast operation (such as reading the first letter, or whatever works for your data), then distributes that piece of data to the appropriate sort bucket. Each sort operates independently until all the data has been processed, then you combine it all back together. This is actually pretty similar to a special case of using MapReduce for sorting.
- Use a MapReduce based sort solution. There's an Open Source project called Hadoop that provides a bunch of sub-projects, one of which is an Open Source MapReduce implementation. I've never used it, however, just read about it. I have no idea if it would be practically applicable to your particular problem.
- Can you index the data, and then just sort that? Is the entire 60GB part of the sort key? Or is there a smaller part that you're sorting on, an then a bunch of additional data for each piece? If it's the latter, indexing and sorting just some sort of key value, and then looking up the additional data as needed, might be the way to go.
- Perhaps you could completely pre-sort your data, and maintain it in a sorted state. Every time you add to, or update, the data, you would correct it from a sorted perspective. This solution would be highly dependent both on how you are storing your data, and on whether the performance impact from the sort updates would be acceptable.
- Lastly, you could punt on the whole thing. Dump your data into an RDBMS (I like PostgresSQL myself), and let the database handle your sorting for you.
Without knowing a lot more about your data and the specifics of what you're doing, that's about the best I can offer for suggestions.
[Note: I'm not an expert on sorting, so someone smarter than me may be able to point out errors in my logic, or suggestions to improve on these.]