How do I randomly sort groups of multiple lines in a multi-GB file?

If you're using a reasonable new linux/unix distribution, sort comes with a -R flag which randomises lines instead of sorts them. We can use that to create this one-liner solution:

awk '{printf("%s%s",$0,(NR%4==0)?"\n":"\0")}' file.txt | sort -R | tr "\0" "\n" > sorted.txt

First, use awk to group every 4 lines by replacing \n with \0. We then shuffle the lines using sort -R and finally restore the line breaks with tr.


This is in Python. I'm sure someone will post a Perl answer too. ;-)

#!/usr/bin/python

import random

#Change these to the desired files
infile = "/path/to/input/file"
outfile = "/path/to/output/file"

fh = file(infile)
contents = fh.readlines()
fh.close()

chunked = [contents[i:i+4] for i in xrange(0, len(contents), 4)]
random.shuffle(chunked)

fh = file(outfile, 'w')

for chunk in chunked:
    for line in chunk:
        fh.write(line)

fh.close()

IANA Programmer so somebody could probably improve this, but I tested it and it works just fine.