How can I get the uncompressed size of gzip file without actually decompressing it?
Please find my OS details:
$ uname -a
AIX xxyy 1 6 000145364C00
I've tried the following command to get size of a file in gzip archive:
$ gzip -l mycontent.DAT.Gz
compressed uncompr. ratio uncompressed_name
-1223644243 1751372002 -75.3% mycontent.DAT.Gz
Not sure how to interpret the unzipped size from this. Compressed file size close to 4 GB.
So, I tried this option in order to capture correct data:
$ zcat mycontent.DAT.Gz | wc -c
It gives me this error:
mycontent.DAT.Gz.Z:A file or directory in the path name does not exist.
0
Can you please tell me how to capture this value from shell script without decompressing the source file?
To answer the question title:
How can I get the uncompressed size of gzip file without actually decompressing it?
As you obviously know, the option -l
(--list
) is usually showing the uncompressed size.
What it shows is not calculated from the data, but was stored in the header as part of the compressed file.
In your case, the -l
option does not work for some reason.
But it's not possible to 'measure' the uncompressed size from the raw compressed data - there is just no information about anything else in the compressed data - which is not surprising, as the point of compression is to leave out anything not needed.
You do not need to store the uncompressed data on the disk: zcat file.gz | wc -c
is the right approach - but as @OleTange answered, your zcat
seems to be not the one from gzip
.
The alternative is using the gzip
options -d
(--decompress
) and -c
(--to-stdout
), combined with wc
option -c
(--bytes
):
gzip -dc file.gz | wc -c
Your zcat
is not GNU zcat but from compress. Try:
gzcat mycontent.DAT.Gz | LC_ALL=C wc -c
gzip -dc mycontent.DAT.Gz | LC_ALL=C wc -c
I like using pv
as it shows a more human readable information and progress:
zcat file.gz | pv > /dev/null
Outputs:
7,65GiB 0:00:44 [ 174MiB/s] [
Unfortunately, the only way to know is to extract it and count the bytes. gzip files do not properly report uncompressed data >4GB in size. See RFC1952, which defines the gzip file format:
ISIZE (Input SIZE)
This contains the size of the original (uncompressed) input
data modulo 2^32.
This discrepancy might be a little more obvious if whatever version of gzip you are using didn't have a bug: it is treating the ISIZE
value as a signed 32-bit integer (resulting in -1223644243), rather than an unsigned 32-bit integer (which would result in 3071323053).
The most you can determine based on the header alone is that the real size of the uncompressed data is
(n * 4,294,967,296) + 3,071,323,053
where n
is some whole number.