Why are tar.xz files 15x smaller when using Python's tar library compared to macOS tar?
Context
I'm compressing ~1.3 GB folders each filled with 1440 JSON files and find that there's a 15-fold difference between using the tar
command and Python's built-in tarfile
library on macOS or Raspbian 10 (Buster)
Minimal working example
This script compares both methods:
#!/usr/bin/env python3
from pathlib import Path
from subprocess import call
import tarfile
fullpath = Path("/Users/user/Desktop/temp/tar/2021-03-11")
zsh_out = Path(fullpath.parent, "zsh-archive.tar.xz")
py_out = Path(fullpath.parent, "py-archive.tar.xz")
# tar using terminal
# tar cJf zsh-archive.tar.xz folderpath
call(["tar", "cJf", zsh_out, fullpath])
# tar using tarfile library
with tarfile.open(py_out, "w:xz") as tar:
tar.add(fullpath, arcname=fullpath.stem)
# Print filesizes
print(f"zsh tar filesize: {round(Path(zsh_out).stat().st_size/(1024*1024), 2)} MB")
print(f"py tar filesize: {round(Path(py_out).stat().st_size/(1024*1024), 2)} MB")
The output is:
zsh tar filesize: 23.7 MB
py tar filesize: 1.49 MB
The versions I use are as follows:
-
tar
on macOS:bsdtar 3.3.2 - libarchive 3.3.2 zlib/1.2.11 liblzma/5.0.5 bz2lib/1.0.6
-
tar
on Raspbian 10:xz (XZ Utils) 5.2.4 liblzma 5.2.4
-
tarfile
Python library:0.9.0
Things I've tried
After compression, I've extracted both archives and compared the resulting folder with:
diff -r py-archive-expanded zsh-archive-expanded
There was no difference.
If I compare the two tar archives directly, they seem different:
➜ diff zsh-archive.tar.xz py-archive.tar.xz
Binary files zsh-archive.tar.xz and py-archive.tar.xz differ
If I inspect the archives with Quicklook (and the Betterzip plugin) I see that the files in the archive are ordered in a different way:
Left is zsh-archive.tar.xz
, right is py-archive.tar.xz
:
The zsh archive uses an unknown order, and the Python archive orders the file by modification date. I am not sure if that matters.
Question
What is going on? Am I losing something by using the Python library to compress my data? Is the 15-fold difference in size an indicator of some issue? Or can I safely go ahead and use the efficient Python implementation?
Short answer: yes, it is safe to use Python tarlib
to compress the data, nothing is lost compared to BSD tar
.
Underlying issue: sorting
I think the underlying issue is that BSD tar
and GNU tar
without any sort options put the files in the archive in an undefined order.
GNU tar
has a --sort
option:
sort directory entries according to
ORDER
, which is one ofnone
,name
, orinode
.
The default is--sort=none
, which stores archive members in the same order as returned by the operating system.
Testing GNU tar
To test this I installed GNU tar
on my Mac with:
brew install gnu-tar
And then tarred the same folder, but with the --sort
option:
gtar --sort='name' -cJf zsh-archive-sorted.tar.xz /Users/user/Desktop/temp/tar/2021-03-11
The zsh-archive-sorted.tar.xz
archive is 1.5 MB, equal to the size of the archive created by the Python library.
Concatenating in sorted order
The effect sorting has on the final archive size is further demonstrated by first concatenating all the JSON files sorted by name (which has the creation unixtime at the beginning of it) and then tarring with BSD tar
:
cat *.json > all.txt
tar cJf zsh-cat-archive.tar.xz all.txt
The zsh-cat-archive.tar.xz
archive is also 1.5 MB.
Python tarfile
sorting
Finally, the documentation of the Python TarFile.add
function confirms that Python tarfile
sorts by default:
Directories are added recursively by default. This can be avoided by setting recursive to False. Recursion adds entries in sorted order.
Why sorting matters
I think the reason sorting has such an impact in my case is as follows:
My JSON files contain locations of hundreds of vehicles. Every minute I read out all the locations, but only a few of these locations have a different value from minute to minute.
By sorting the files by name, two subsequent files have little different characters between them.
Apparently this is very favourable for the compression efficiency.
Try setting the compression levels in the macOS command line.
I know you are asking about xz
but explained in this answer here, on older versions of GZip you can set the compression level with an environment variable like this:
GZIP=-9 tar cf zsh-archive.tar.xz folderpath
That said, that only seems to work with GZip 1.8 and is depreciated on later versions. So use the -I
/--use-compress-program=COMMAND
option for tar instead; note this option might not work on macOS but placing here anyway just in case. So the command would then change to:
tar -I 'gzip -9' -cf zsh-archive.tar.xz folderpath
And yes, these examples would be compressing the archive Gzip instead of xz
, but you can easily change the command to this to use xz
like this:
tar -I 'xz -9' -cf zsh-archive.tar.xz folderpath
The xz
compression level ranges from -0
to -9
with the default being -6
; so -9
is the highest compression level.
Just note that xz
is not installed on macOS by default. To install it on macOS you must first install Homebrew and then install xz
via Homebrew like this:
brew install xz
Makes me wonder what Python is using for compression
http://tukaani.org/xz/
It's probably using the function calls in liblzma. Tar is probably piping through the xz shell command.
A quick comment on --sort=name
:
The sort option is a relatively recent enhancement to GNU tar and was introduced in tar version 1.28.
It may never be implemented in BSD tar.