unix - split a huge .gz file by line
I'm sure someone has had the below need, what is a quick way of splitting a huge .gz file by line? The underlying text file has 120million rows. I don't have enough disk space to gunzip the entire file at once so I was wondering if someone knows of a bash/perl script or tool that could split the file (either the .gz or inner .txt) into 3x 40mn line files. ie calling it like:
bash splitter.sh hugefile.txt.gz 4000000 1
would get lines 1 to 40 mn
bash splitter.sh hugefile.txt.gz 4000000 2
would get lines 40mn to 80 mn
bash splitter.sh hugefile.txt.gz 4000000 3
would get lines 80mn to 120 mn
Is perhaps doing a series of these a solution or would the gunzip -c require enough space for the entire file to be unzipped(ie the original problem): gunzip -c hugefile.txt.gz | head 4000000
Note: I can't get extra disk.
Thanks!
How to do this best depends on what you want:
- Do you want to extract a single part of the large file?
- Or do you want to create all the parts in one go?
If you want a single part of the file, your idea to use gunzip
and head
is right. You can use:
gunzip -c hugefile.txt.gz | head -n 4000000
That would output the first 4000000 lines on standard out - you probably want to append another pipe to actually do something with the data.
To get the other parts, you'd use a combination of head
and tail
, like:
gunzip -c hugefile.txt.gz | head -n 8000000 |tail -n 4000000
to get the second block.
Is perhaps doing a series of these a solution or would the gunzip -c require enough space for the entire file to be unzipped
No, the gunzip -c
does not require any disk space - it does everything in memory, then streams it out to stdout.
If you want to create all the parts in one go, it is more efficient to create them all with a single command, because then the input file is only read once. One good solution is to use split
; see jim mcnamara's answer for details.
pipe to split use either gunzip -c or zcat to open the file
gunzip -c bigfile.gz | split -l 400000
Add output specifications to the split command.
As you are working on a (non-rewindable) stream, you will want to use the '+N' form of tail to get lines starting from line N onwards.
zcat hugefile.txt.gz | head -n 40000000
zcat hugefile.txt.gz | tail -n +40000001 | head -n 40000000
zcat hugefile.txt.gz | tail -n +80000001 | head -n 40000000
I'd consider using split.
split a file into pieces
Directly split .gz file to .gz files:
zcat bigfile.gz | split -l 400000 --filter='gzip > $FILE.gz'
I think this is what OP wanted, because he don't have much space.