Sorting large binary files

Is there Unix utility for sorting large files containing fixed length binary records?

In other words, I'm looking for something like sort(1) but for binary files with fixed length records.

I could convert the files into text, then sort using sort(1), and then convert back into the binary representation, but I'm looking for something more time and space efficient.


Solution 1:

Turns out you're in luck; there is a GNU style unix program that does exactly this: bsort.

bsort is a hyper efficient implementation of an inplace radix sort with careful attention paid to memory access patterns when working with files larger than ram. By efficent I mean was able to best the http://sortbenchmark.org's 2014 energy efficient 10^8 record sort on hardware from mid 2014 - the record was 889 Joules, an early prototype of this was able to sort the same in 335 Joules on a stock macbook pro. For "small" data sets that fit entirely in ram (triple digit megabytes) its about 3 times faster than libc's qsort library.

Solution 2:

One solution could be to convert the input file to hex, with each record encoded on a separate line, sort that, and convert back to binary:

record_size=32
cat input \
    |xxd -cols $record_size -plain \
    |sort \
    |xxd -cols $record_size -plain -revert

However, it's slow (xxd converts about 40MB/s on my machine)

So, since I needed it, I've written binsort, which does the job:

binsort --size 32 ./input ./output

With --size 32, it assumes 32-byte fixed size records, reads ./input, writes sorted records to ./output.

Solution 3:

Unix's sort utility is OK for binary data based on byte locations within the records if you refer to them relative to the first "record". Eg -k1.28,1.32.

Unix sort is less flexible regarding it's notion of end-of-line. Depending on your data you may be able to do a much simpler stream edit than the xxd based one user68497 proposes, and use null terminated lines. This is still likely to involve a great deal of copying of data in memory though, and won't come close to the speed of an mmap based approach.

If you do use unix sort though in some manner, be careful of locale. sort assumes its input is text, and locale affects sort order.