Using awk to split text file every 10,000 lines
I have a large gzip'd text file. I'd like to something like:
zcat BIGFILE.GZ | \
awk (snag 10,000 lines and redirect to...) | \
gzip -9 smallerPartFile.gz
the awk part up there, I basically want it to take 10,000 lines and send it to gzip and then repeat until all lines in the original input file are consumed. I found a script that claims to do this, but when I run it on my files and then diff the original to the ones that were split and then merged, lines are missing. So, something is wrong with the awk part and I'm not sure what part is broken.
The goal:
- Read through the source file one time for the entire operation
- Split the source into smaller parts, delimited by newline. Say, 10,000 lines per file
- Compress the target files that are created as a result of the split action and do so without an extra step after this script processes.
Here's the code. Can someone tell me why this doesn't yield a file that can be split and merged and then diff'd to the original successfully?
# Generate files part0.dat.gz, part1.dat.gz, etc.
# restore with: zcat foo* | gzip -9 > restoredFoo.sql.gz (or something like that)
prefix="foo"
count=0
suffix=".sql"
lines=10000 # Split every 10000 line.
zcat /home/foo/foo.sql.gz |
while true; do
partname=${prefix}${count}${suffix}
# Use awk to read the required number of lines from the input stream.
awk -v lines=${lines} 'NR <= lines {print} NR == lines {exit}' >${partname}
if [[ -s ${partname} ]]; then
# Compress this part file.
gzip -9 ${partname}
(( ++count ))
else
# Last file generated is empty, delete it.
rm -f ${partname}
break
fi
done
Solution 1:
I would suggest doing all the house-keeping inside awk
, this works here with GNU awk:
BEGIN { file = "1" }
{ print | "gzip -9 > " file ".gz" }
NR % 10000 == 0 {
close("gzip -9 > " file ".gz")
file = file + 1
}
This will save 10000 lines to 1.gz
, the next 10000 to 2.gz
, etc. Use sprintf
if you want more flexibility in filename generation.
Updated with a test
Test data used are primes up to 300k, found here.
wc -lc primes; md5sum primes
Output:
25997 196958 primes
547d527ec50c2799fa6ce96dba3c26c0 primes
Now, if the awk program above was saved into split.awk
and run like this (with GNU awk):
awk -f split.awk primes
Three files (1.gz, 2.gz and 3.gz) are produced. Testing these files:
for f in {1..3}; do gzip -dc $f.gz >> foo; done
Test:
diff source.file foo
Output should be nothing if the files are the same.
And the same tests as above:
gzip -dc [1-3].gz | tee >(wc -lc) >(md5sum) > /dev/null
Output:
25997 196958
547d527ec50c2799fa6ce96dba3c26c0 -
This shows that the contents are the same and that the files are split as expected.
Solution 2:
The shorter (and more useful) answer: have you looked at the Unix split
command?
Solution 3:
The short answer is that awk
is reading its input (the pipe from zcat
, in this case) a block at a time (where a block is 512 bytes, or a multiple thereof, depending on your OS). So, by the time it has the 10000th newline character (end-of-line marker) in memory, it also has the 10001st line, the 10002nd, and quite probably more (or possibly less) in memory, too. This is a problem because it means those characters have been read out of the pipe, and are no longer available for the next iteration of awk
to read.
Solution 4:
I thought about it and found a way, not efficient at all, which will useless decompress entirely each file to take each piece, meaning that if you want to split in 20 pieces, it will decompress the big files 20 times. But it won't store the whole file, only the compressed piece, so while it's storage efficient it's cpu inefficient.
Script should be run with first argument the big gzip file and second argument the number of lines to split.
#!/bin/bash
GZIP_FILE=$1
SPLIT_LINES=$2
TOTAL_LINES=`zcat $GZIP_FILE|wc -l`
START=0
NEXT_START=0
while [ $NEXT_START -lt $TOTAL_LINES ]; do
NEXT_START=$(( $NEXT_START + $SPLIT_LINES ))
echo .
zcat $GZIP_FILE|sed -n ${START},${NEXT_START}p |gzip -9 > ${GZIP_FILE}.lines-${START}-${NEXT_START}.gz
START=$NEXT_START
done
This will create in the same dir for each piece a file named as the gzip file and appending ".lines-$startline-$endline.gz"
Hope you are ok wasting CPU :)