Split a large .gz file and gzip each of split file?

Solution 1:

Here's a loop around awk and gzip that will split a file on line boundaries and compress the parts as it goes:

# Generate files part0.dat.gz, part1.dat.gz, etc.
prefix="part"
count=0
suffix=".dat"

lines=10000 # Split every 10000 line.

zcat thefile.dat.gz |
while true; do
  partname=${prefix}${count}${suffix}

  # Use awk to read the required number of lines from the input stream.
  awk -v lines=${lines} 'NR <= lines {print} NR == lines {exit}' >${partname}

  if [[ -s ${partname} ]]; then
    # Compress this part file.
    gzip --best ${partname}
    (( ++count ))
  else
    # Last file generated is empty, delete it.
    rm -f ${partname}
    break
  fi
done

To recreate the original file, just zcat part*.dat.gz | gzip --best >thefile1.dat.gz. The compressed file might have a different MD5 checksum from the original due to varying gzip compression options used, but the uncompressed files will be absolutely identical.