How to tar.gz many similar-size files into multiple archives with a size limit
Solution 1:
Totally patchwork and a quick, rough sketch as it is, but tested on a directory with 3000 files, the script below did an extremely fast job:
#!/usr/bin/env python3
import subprocess
import os
import sys
splitinto = 2
dr = sys.argv[1]
os.chdir(dr)
files = os.listdir(dr)
n_files = len(files)
size = n_files // splitinto
def compress(tar, files):
command = ["tar", "-zcvf", "tarfile" + str(tar) + ".tar.gz", "-T", "-", "--null"]
proc = subprocess.Popen(command, stdin=subprocess.PIPE)
with proc:
proc.stdin.write(b'\0'.join(map(str.encode, files)))
proc.stdin.write(b'\0')
if proc.returncode:
sys.exit(proc.returncode)
sub = []; tar = 1
for f in files:
sub.append(f)
if len(sub) == size:
compress(tar, sub)
sub = []; tar += 1
if sub:
# taking care of left
compress(tar, sub)
How to use
- Save it into an empty file as
compress_split.py
- In the head section, set the number of files to compress into. In practice, there will always be one more to take care of the remaining few "left overs".
-
Run it with the directory with your files as argument:
python3 /path/tocompress_split.py /directory/with/files/tocompress
numbered .tar.gz
files will be created in the same directory as where the files are.
Explanation
The script:
- lists all files in the directory
- cd's into the directory to prevent adding the path info to the tar file
- reads through the file list, grouping them by the set division
- compresses the sub group(s) into numbered files
EDIT
Automatically create chunks by size in mb
More sophisticated is to use the max- size (in mb) of the chunks as a (second) argument. In the script below, the chunks are written into a compressed file as soon as the chunk reaches (passes) the threshold.
Since the script is triggered by the chunks, exceeding the threshold, this will only work if the size of (all) files is substantially smaller than the chunk size.
The script:
#!/usr/bin/env python3
import subprocess
import os
import sys
dr = sys.argv[1]
chunksize = float(sys.argv[2])
os.chdir(dr)
files = os.listdir(dr)
n_files = len(files)
def compress(tar, files):
command = ["tar", "-zcvf", "tarfile" + str(tar) + ".tar.gz", "-T", "-", "--null"]
proc = subprocess.Popen(command, stdin=subprocess.PIPE)
with proc:
proc.stdin.write(b'\0'.join(map(str.encode, files)))
proc.stdin.write(b'\0')
if proc.returncode:
sys.exit(proc.returncode)
sub = []; tar = 1; subsize = 0
for f in files:
sub.append(f)
subsize = subsize + (os.path.getsize(f)/1000000)
if subsize >= chunksize:
compress(tar, sub)
sub = []; tar += 1; subsize = 0
if sub:
# taking care of left
compress(tar, sub)
To run:
python3 /path/tocompress_split.py /directory/with/files/tocompress chunksize
...where chunksize is the size of input for the tar command.
In this one, the suggested improvements by @DavidFoerster are included. Thanks a lot!
Solution 2:
A pure shell approach:
files=(*);
num=$((${#files[@]}/8));
k=1
for ((i=0; i<${#files[@]}; i+=$num)); do
tar cvzf files$k.tgz -- "${files[@]:$i:$num}"
((k++))
done
Explanation
-
files=(*)
: save the list of files (also directories if any are present, change tofiles=(*.txt)
to get only things with atxt
extension) in the array$files
. -
num=$((${#files[@]}/8));
:${#files[@]}
is the number of elements in the array$files
. The$(( ))
is bash's (limited) way of doing arithmetic. So, this command sets$num
to the number of files divided by 8. -
k=1
: just a counter to name the tarballs. -
for ((i=0; i<${#files[@]}; i+=$num)); do
: iterate over the values of the array.$i
is initialized at0
(the first element of the array) and incremented by$num
. This continues until we've gone through all elements (files). -
tar cvzf files$i.tgz -- ${files[@]:$i:$num}
: in bash, you can get an array slice (part of an array) using${array[@]:start:length}
, So${array[@]:2:3}
will return three elements starting from the second. Here, we are taking a slice that starts at the current value of$i
and is$num
elements long. The--
is needed in case any of your file names can start with a-
. -
((k++))
: increment$k