How do we sort faster using unix sort?
We are sorting a 5GB file with 37 fields and sort it with 5 keys. The big file is composed of 1000 files of 5MB each.
After 190 minutes it still hasn't finished.
I am wondering if there are other methods to speed up the sorting. We choose unix sort because we don't want it to use up all the memory, so any memory based approach is not okay.
What is the advantage of sorting each files independently, and then use -m option to merge sort it?
Buffer it in memory using -S
. For example, to use (up to) 50% of your memory as a sorting buffer do:
sort -S 50% file
Note that modern Unix sort
can sort in parallel. My experience is that it automatically uses as many cores as possible. You can set it directly using --parallel
. To sort using 4 threads:
sort --parallel=4 file
So all in all, you should put everything into one file and execute something like:
sort -S 50% --parallel=4 file
- Divide and conquer. A sort of N files can be faster if you first sort each of the N
files (and use different CPUs on multiprocessors). Then the files need only be merged (e.g.
sort -m files ...
; -m is POSIX and should be supported by all sorts of sorts; pun intended). Sorting each file consumes much less resources. - Give sort a fast /tmp directory
- Thinking outside the box: make the process creating the files sort the data right away
- Brute force: Throw more hardware (memory, CPU cycles) at the problem :-)
- Get informed about the concept of external sorting