Sort a file with huge volume of data given memory constraint

Points:

  • We process thousands of flat files in a day, concurrently.
  • Memory constraint is a major issue.
  • We use thread for each file process.
  • We don't sort by columns. Each line (record) in the file is treated as one column.

Can't Do:

  • We cannot use unix/linux's sort commands.
  • We cannot use any database system no matter how light they can be.

Now, we cannot just load everything in a collection and use the sort mechanism. It will eat up all the memory and the program is gonna get a heap error.

In that situation, how would you sort the records/lines in a file?


It looks like what you are looking for is external sorting.

Basically, you sort small chunks of data first, write it back to the disk and then iterate over those to sort all.


You can read the files in smaller parts, sort these and write them to temporrary files. Then you read two of them sequentially again and merge them to a bigger temporary file and so on. If there is only one left you have your file sorted. Basically that's the Megresort algorithm performed on external files. It scales quite well with aribitrary large files but causes some extra file I/O.

Edit: If you have some knowledge about the likely variance of the lines in your files you can employ a more efficient algorithm (distribution sort). Simplified you would read the original file once and write each line to a temporary file that takes only lines with the same first char (or a certain range of first chars). Then you iterate over all the (now small) temporary files in ascending order, sort them in memory and append them directly to the output file. If a temporary file turns out to be too big for sorting in memory, you can reapeat the same process for this based on the 2nd char in the lines and so on. So if your first partitioning was good enough to produce small enough files, you will have only 100% I/O overhead regardless how large the file is, but in the worst case it can become much more than with the performance wise stable merge sort.