Does tar | gpg | bzip2 use more memory than doing each step individually?
I have a bunch of files I routinely need to tar, encrypt with gpg, and then compress. This is on a Linus VPS server, so memory is more of a consideration than speed of execution (which I really don't care about).
If I do the three steps as one command (tar | gpg | bzip2 > output.tar.gpg.bzip2) is that going to consume more memory than if I first call tar, then call gpg, then call bzip2?
The files are potentially quite large (hundreds of megabytes/gigabytes)
The more programs running simultaneously, the more memory it will require. But it is a trade-off, using pipes instead of running each program separately tends to require less disk space. You only need storage for the initial input and final output. If you are piping through ssh or something similar to store the data elsewhere, you may not even need initial disk space beyond the input file. Another feature with using pipes is that data is not processed at any stage faster than the slowest stage in the pipe.
The OpenPGP format that GnuPG uses supports compression natively. This happens before encryption so it is more secure and far more efficient than post-encryption compression. Also, it will be auto-detected on decryption so you won't have to worry about adding it to the pipeline. It may also require less memory than running a separate compression program. GnuPG supports Zip, Zlib (Gzip), and Bzip2 compression.
When you do that ( tar | gpg | bzip2 > output.tar.gpg.bzip2
) you are piping commands, that means you are running all commands at the same time and redirecting the outputs of each command into the input of the following command (until the last output you redirect into the file). So tar
output goes to gpg
input that outputs to bzip2
input that finally outputs into a file.
So, when you pipe you use more memory because you are running all commands at the same time. You also use more processing power since gpg
and bzip2
are two cpu hungry programs.
You should check that bzip2 is doing something worthwhile with the size, pgp data should end up looking like a (completely) random stream, and thus shouldn't really be compressible.
All these commands: tar
, bzip2
and gpg
work in a a stream-like way: they take a chunk of data (let say one megabyte), transform it and push it to the next stage. After that, RAM is reused to take care of another chunk of data. So even if you have a terabyte of data to process, it will be processed piece-by-piece.
Now, when you call tar
by itself, tar will work this way: read one chunk of data from disk, wrap it with some headers necessary for archiving, push it to disk and forget about it. And again. And again. In this process, you'll never need more than one chunk of data in RAM. I don't know how big will tar
s buffer be, but it will be probably few megabytes only. So you will need only few megabytes of RAM to make it work.
gpg
and bzip2
work the same way. gpg
takes a small chunk of data, encrypts it and pushes forward -- all in a loop. bzip2
takes a small chunk of data, compresses it and pushes forward -- all in a loop.
You will need less RAM if you will call these commands without a pipeline: you will need to keep only one chunk of data at the same time in RAM. With a pipeline, chunks of data will be passed from one command to another, keeping it in RAM: you will need few chunks of data: one for tar
, one for gpg
, one for bzip2
. So several megabytes tops. You will not need RAM to store all data at once.
With pipelining you can actually get another benefit: speed. You don't need to store temporary data on disk. Chunks of data get passed from tar
to bzip2
and from bzip2
to gpg
using RAM only. Compare: 3GB of data, 3GB of data in a tar file, 1GB of data in a tar.bzip2 file, 1GB of data in a tar.bzip2.gpg file. Even if you don't store them at the same time (f.e. removing source files in the process), you still need to write and read 4GB more data to disk than in a pipeline version. And VPSes are usually quite slow at disk operations, so a pipeline can spare you lot of time.
(Note: this explanation is simplified. All these commands do keep some additional data in RAM, but this is rarely significant. Also, you can often change size of buffers in specific commands.)