How do I randomly sort groups of multiple lines in a multi-GB file?
If you're using a reasonable new linux/unix distribution, sort
comes with a -R
flag which randomises lines instead of sorts them. We can use that to create this one-liner solution:
awk '{printf("%s%s",$0,(NR%4==0)?"\n":"\0")}' file.txt | sort -R | tr "\0" "\n" > sorted.txt
First, use awk
to group every 4 lines by replacing \n
with \0
. We then shuffle the lines using sort -R
and finally restore the line breaks with tr
.
This is in Python. I'm sure someone will post a Perl answer too. ;-)
#!/usr/bin/python import random #Change these to the desired files infile = "/path/to/input/file" outfile = "/path/to/output/file" fh = file(infile) contents = fh.readlines() fh.close() chunked = [contents[i:i+4] for i in xrange(0, len(contents), 4)] random.shuffle(chunked) fh = file(outfile, 'w') for chunk in chunked: for line in chunk: fh.write(line) fh.close()
IANA Programmer so somebody could probably improve this, but I tested it and it works just fine.