bash zcat head causes pipefail?
set -eu
VAR=$(zcat file.gz | head -n 12)
works fine
set -eu -o pipefail
VAR=$(zcat file.gz | head -n 12)
causes bash to exit with failure. How is this causing a pipefail?
Note that file.gz contains millions of lines (~ 750 MB, compressed).
Think about it, for a moment.
- You're telling the shell that your entire pipeline should be considered to have failed if any component failed.
- You're telling
zcat
to write its output tohead
. - Then you're telling
head
to exit after reading 12 lines, out of a much-longer-than-12-line input stream.
Of course you have an error: zcat
has its destination pipeline closed early, and wasn't able to successfully write a decompressed version of your input file! It doesn't have any way of knowing that this was due to user intent, via something erroneous happening.
If you were using zcat
to write to a disk and it ran out of space, or to a network stream and there was a connection loss, it would be entirely correct and appropriate for it to exit with a status indicating a failure. This is simply another case of that rule.
The specific error which zcat
is being given by the operating system is EPIPE
, returned by the write
syscall under the following condition: An attempt is made to write to a pipe that is not open for reading by any process.
After head
(the only reader of this FIFO) has exited, for any write to the input side of pipeline not to return EPIPE would be a bug. For zcat
to silently ignore an error writing its output, and thus be able to generate an inaccurate output stream without an exit status reflecting this event, would likewise be a bug.
If you don't want to change any of your shell options, by the way, one workaround you might consider is using process substitution:
var=$(head -n 12 < <(zcat file.gz))
In this case, zcat
is not a pipeline component, and its exit status is not considered for purposes of determining success. (You might test whether $var
is 12 lines long, if you want to come up with an independent success/fail determination).
A more comprehensive solution could be implemented by pulling in a Python interpreter, with its native gzip support. A native Python implementation (compatible with both Python 2 and 3.x), embedded in a shell script, might look something like:
zhead_py=$(cat <<'EOF'
import sys, gzip
gzf = gzip.GzipFile(sys.argv[1], 'rb')
outFile = sys.stdout.buffer if hasattr(sys.stdout, 'buffer') else sys.stdout
numLines = 0
maxLines = int(sys.argv[2])
for line in gzf:
if numLines >= maxLines:
sys.exit(0)
outFile.write(line)
numLines += 1
EOF
)
zhead() { python -c "$zhead_py" "$@"; }
...which gets you a zhead
that doesn't fail if it runs out of input data, but does pass through a failed exit status for genuine I/O failures or other unexpected events. (Usage is of the form zhead in.gz 5
, to read 5 lines from in.gz
).
Alternatively, you can use
zcat file.gz | awk '(NR<=12)'
The price is that you need to go through all the zcat
, no early stop based on the lines you specified.