unix - split a huge .gz file by line

I'm sure someone has had the below need, what is a quick way of splitting a huge .gz file by line? The underlying text file has 120million rows. I don't have enough disk space to gunzip the entire file at once so I was wondering if someone knows of a bash/perl script or tool that could split the file (either the .gz or inner .txt) into 3x 40mn line files. ie calling it like:

    bash splitter.sh hugefile.txt.gz 4000000 1
 would get lines 1 to 40 mn    
    bash splitter.sh hugefile.txt.gz 4000000 2
would get lines 40mn to 80 mn
    bash splitter.sh hugefile.txt.gz 4000000 3
would get lines 80mn to 120 mn

Is perhaps doing a series of these a solution or would the gunzip -c require enough space for the entire file to be unzipped(ie the original problem): gunzip -c hugefile.txt.gz | head 4000000

Note: I can't get extra disk.

Thanks!


How to do this best depends on what you want:

  • Do you want to extract a single part of the large file?
  • Or do you want to create all the parts in one go?

If you want a single part of the file, your idea to use gunzip and head is right. You can use:

gunzip -c hugefile.txt.gz | head -n 4000000

That would output the first 4000000 lines on standard out - you probably want to append another pipe to actually do something with the data.

To get the other parts, you'd use a combination of head and tail, like:

gunzip -c hugefile.txt.gz | head -n 8000000 |tail -n 4000000

to get the second block.

Is perhaps doing a series of these a solution or would the gunzip -c require enough space for the entire file to be unzipped

No, the gunzip -c does not require any disk space - it does everything in memory, then streams it out to stdout.


If you want to create all the parts in one go, it is more efficient to create them all with a single command, because then the input file is only read once. One good solution is to use split; see jim mcnamara's answer for details.


pipe to split use either gunzip -c or zcat to open the file

gunzip -c bigfile.gz | split -l 400000

Add output specifications to the split command.


As you are working on a (non-rewindable) stream, you will want to use the '+N' form of tail to get lines starting from line N onwards.

zcat hugefile.txt.gz | head -n 40000000
zcat hugefile.txt.gz | tail -n +40000001 | head -n 40000000
zcat hugefile.txt.gz | tail -n +80000001 | head -n 40000000

I'd consider using split.

split a file into pieces


Directly split .gz file to .gz files:

zcat bigfile.gz | split -l 400000 --filter='gzip > $FILE.gz'

I think this is what OP wanted, because he don't have much space.