Head with a weird behavior

I have downloaded a warc file from Common Crawl in Ubuntu 18.04. After decompressing it with gzip, I've tried to get a segment of the file using head. I first tried:

head -c 29 CC-MAIN-20210620114611-20210620144611-00436.warc

It produced the expected result, outputting the first 29 bytes of the file:

WARC/1.0
WARC-Type: warcinfo

But, if instead of 29, I use 30, it produces a result I was not expecting:

head -c 30 CC-MAIN-20210620114611-20210620144611-00436.warc

Output:

WARC/1.0

This is only the first 10 bytes of the file, not the first 30. If I use head -c 31, the result is the expected back again. I have no idea if this is a bug or if there is a detail on how head works that I'm not aware of.


The head command is almost certainly outputting the requested number of bytes, however what those bytes are is affecting how they are displayed in your terminal.

Specifically, your gunzipped file almost certainly has DOS-style CRLF line endings, with a CR at byte 30 and LF at byte 31. When you do head -c29, the head output excludes both line ending bytes, and you see something like

yourname@computer:~$ head -c29 file.warc
WARC/1.0
WARC-Type: responseyourname@computer:~$

with your shell prompt following directly after the 29th byte. When you do head -c31, you capture both the CR and the LF, and the output looks like

yourname@computer:~$ head -c31 file.warc
WARC/1.0
WARC-Type: response
yourname@computer:~$

However when you do head -c30, the output contains the terminating CR but not its following LF - the cursor is sent back to position 0, but is left on the same line of the terminal, where it is then overwritten by your shell prompt:

yourname@computer:~$ head -c31 file.warc
WARC/1.0
yourname@computer:~$

If the line is longer than your prompt, you will see characters from the file peeking out beyond the end. If your PS1 prompt was empty, then you would have seen the full expected output.