Does gunzip work in memory or does it write to disk?
It's always going to be quicker to cat the uncompressed file as there's no overhead associated with that. Even if you're not writing a temporary file, you're going through the decompression motions, which munch CPU. If you're accessing these files often enough, it's probably better to keep them uncompressed if you have the space.
That said, dumping data to standard out (gunzip -c, zcat, etc...) won't trigger writing to a temporary file. The data is piped directly to the grep
command, which treats the uncompressed stream as it's own standard in.
The Wikipedia article on LZ* encoding is here: http://en.wikipedia.org/wiki/LZ77_and_LZ78.
As always, nothing beats actual measurement.
Your mileage may vary, but on my system, grepping an already uncompressed file took about a third the time that piping zcat
or gunzip
into grep
did. This isn't surprising.
Using compression could actually deliver faster throughput to disks, but that depends on a number of factors, including the compression algorithm used and the kind of data you're moving around. ZFS, for example, heavily relies on this assumption.
gzip will either decompress the whole file to a temporary one and rename it in the end (standard gzip -d myfile.gz
) or not use any temporary file at all reading some blocks of compressed data at a time and spitting uncompressed data on stdout (gzip -d -c...
).
On a modern system I suspect a gunzip | grep
could be faster than grepping an uncompressed file, on the other hand gunzip | grep
will always win over decompressing a file and then grepping the uncompressed one :)
You can also substitute gzip with lzo to improve performance.
Using of LZO can make things faster (lesser disk input-ouput and little compression CPU overhead)
gzip -dc | grep foo (or gunzip -c) | grep foo writes to a pipe. How the pipe is implemented is dependent on your operating system, but generally it will stay in memory. As others have pointed out, grepping an uncompressed file is always going to be faster due to the time it takes to decompress the compressed data. Using a different compression program may or may not improve performance; you can always measure it.