curl / wget is adding an extra ^M when I append data to a file

Something is buzzing me about this. I'm trying to download two different hosts files into one, if I do this serperatly then everything is fine, but when I append the firs to the second a strange charachter ^M appears at each line of the host file.

To give a real example here what I'm doing

wget https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts -O /etc/hosts && curl -s "https://raw.githubusercontent.com/CHEF-KOCH/CKs-FilterList/master/HOSTS/CK's-Spotify-HOSTS-FilterList.txt" >> /etc/hosts

now /etc/hosts have these: enter image description here

but when I do this separately, so

curl -s "https://raw.githubusercontent.com/CHEF-KOCH/CKs-FilterList/master/HOSTS/CK's-Spotify-HOSTS-FilterList.txt" > /tmp/hosts

now /tmp/hosts is perfectly normal

enter image description here

Why is this happening? Why when I download the files separately I don't get the wrong linefeed, yet when I combine them I get it. It's supposed to be 0x0a not 0x0a0x0d, why is this happening?

If you need to have a look at the files being downloaded you can head to the links in the commands:

  1. https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts
  2. https://raw.githubusercontent.com/CHEF-KOCH/CKs-FilterList/master/HOSTS/CK%27s-Spotify-HOSTS-FilterList.txt

EDIT: I tried to append only the second host file to a dumb hosts file and the same happened, so we can omit that the first file is the cause of the problem


Solution 1:

No tool is adding anything. It's quite a confusion (but not your fault at all) because of few reasons.

There are two common line endings:

  • Unix-style, one character denoted LF (or \n or 0x0a),
  • Windows-style, two characters, CRLF (or \r\n or 0x0d 0x0a).

You download from two different URLs. It seems the server claims each file is text/plain, so they should use CRLF. The second one (the one you curl) does indeed use CRLF, but the first one (the one you wget) illegally uses sole LF instead.

If you download only from the first URL (no matter if with wget or curl) and store the result in a hosts1 file, then file hosts1 will yield:

hosts1: UTF-8 Unicode text

(This means the line endings are LF, otherwise it would be UTF-8 Unicode text, with CRLF line terminators).

If you download only from the second URL and store the result in a hosts2 file, then file hosts2 will yield:

hosts2: ASCII text, with CRLF line terminators

If you download both to the same file (say hosts12) in the way you do, you will get LF as line endings for lines that came from the first URL and CRLF as line endings for lines that came from the second URL.

In practice any tool that tries to tell whether a file uses LF or CRLF examines at most few initial lines, not all of them. Try file hosts12 and you'll get:

hosts12: UTF-8 Unicode text

exactly as it was for hosts1. The same happens when you vim hosts12: the editor detects line endings as LF based on the beginning of the file. Then you skip to the end and you see many ^M-s which denote CR characters. vim prints them because it doesn't consider CR to be a part of proper line ending in this case.

However when you vim hosts2, the editor correctly detects line endings as CRLF. The same CR characters that were printed as ^M earlier, now are hidden from you because vim considers them to be parts of proper line endings. If you added a new line by hand, vim would use the Windows-style line ending even if you're on Unix. You may think the file is "perfectly normal" but it's not a normal Unix text file.

The confusion is because the two files on the server use different line endings; then vim tries to be smart.

In Linux (Unix in general) you want your /etc/hosts to use LF as line endings. See POSIX definitions of line and newline character. It's explicitly stated the character is \n:

3.243 Newline Character (<newline>)
A character that in the output stream indicates that printing should start at the beginning of the next line. It is the character designated by '\n' in the C language.

I don't think tools are obligated to support \r\n then. The simple solution is to run wget … && curl … >> … exactly as you did, then invoke dos2unix /etc/hosts.

If I were you I would work with another file, say /etc/hosts.tmp. I would use wget, curl, dos2unix, chmod --reference=/etc/hosts, chown --reference=/etc/hosts. Only when the file is complete, I would mv it to replace /etc/hosts. This feature of rename(2) is relevant:

If newpath already exists, it will be atomically replaced, so that there is no point at which another process attempting to access newpath will find it missing.

So any process would find either the old /etc/hosts (before mv) or the new one (after mv). Your current approach, directly working with /etc/hosts allows scenarios when another process finds the file incomplete or with wrong line endings near its end.