curl / wget is adding an extra ^M when I append data to a file
Something is buzzing me about this. I'm trying to download two different hosts files into one, if I do this serperatly then everything is fine, but when I append the firs to the second a strange charachter ^M
appears at each line of the host file.
To give a real example here what I'm doing
wget https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts -O /etc/hosts && curl -s "https://raw.githubusercontent.com/CHEF-KOCH/CKs-FilterList/master/HOSTS/CK's-Spotify-HOSTS-FilterList.txt" >> /etc/hosts
now /etc/hosts
have these:
but when I do this separately, so
curl -s "https://raw.githubusercontent.com/CHEF-KOCH/CKs-FilterList/master/HOSTS/CK's-Spotify-HOSTS-FilterList.txt" > /tmp/hosts
now /tmp/hosts
is perfectly normal
Why is this happening? Why when I download the files separately I don't get the wrong linefeed, yet when I combine them I get it. It's supposed to be 0x0a not 0x0a0x0d, why is this happening?
If you need to have a look at the files being downloaded you can head to the links in the commands:
- https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts
- https://raw.githubusercontent.com/CHEF-KOCH/CKs-FilterList/master/HOSTS/CK%27s-Spotify-HOSTS-FilterList.txt
EDIT: I tried to append only the second host file to a dumb hosts file and the same happened, so we can omit that the first file is the cause of the problem
Solution 1:
No tool is adding anything. It's quite a confusion (but not your fault at all) because of few reasons.
There are two common line endings:
- Unix-style, one character denoted
LF
(or\n
or0x0a
), - Windows-style, two characters,
CRLF
(or\r\n
or0x0d 0x0a
).
You download from two different URLs. It seems the server claims each file is text/plain
, so they should use CRLF
. The second one (the one you curl
) does indeed use CRLF
, but the first one (the one you wget
) illegally uses sole LF
instead.
If you download only from the first URL (no matter if with wget
or curl
) and store the result in a hosts1
file, then file hosts1
will yield:
hosts1: UTF-8 Unicode text
(This means the line endings are LF
, otherwise it would be UTF-8 Unicode text, with CRLF line terminators
).
If you download only from the second URL and store the result in a hosts2
file, then file hosts2
will yield:
hosts2: ASCII text, with CRLF line terminators
If you download both to the same file (say hosts12
) in the way you do, you will get LF
as line endings for lines that came from the first URL and CRLF
as line endings for lines that came from the second URL.
In practice any tool that tries to tell whether a file uses LF
or CRLF
examines at most few initial lines, not all of them. Try file hosts12
and you'll get:
hosts12: UTF-8 Unicode text
exactly as it was for hosts1
. The same happens when you vim hosts12
: the editor detects line endings as LF
based on the beginning of the file. Then you skip to the end and you see many ^M
-s which denote CR
characters. vim
prints them because it doesn't consider CR
to be a part of proper line ending in this case.
However when you vim hosts2
, the editor correctly detects line endings as CRLF
. The same CR
characters that were printed as ^M
earlier, now are hidden from you because vim
considers them to be parts of proper line endings. If you added a new line by hand, vim
would use the Windows-style line ending even if you're on Unix. You may think the file is "perfectly normal" but it's not a normal Unix text file.
The confusion is because the two files on the server use different line endings; then vim
tries to be smart.
In Linux (Unix in general) you want your /etc/hosts
to use LF
as line endings. See POSIX definitions of line and newline character. It's explicitly stated the character is \n
:
3.243 Newline Character (
<newline>
)
A character that in the output stream indicates that printing should start at the beginning of the next line. It is the character designated by'\n'
in the C language.
I don't think tools are obligated to support \r\n
then. The simple solution is to run wget … && curl … >> …
exactly as you did, then invoke dos2unix /etc/hosts
.
If I were you I would work with another file, say /etc/hosts.tmp
. I would use wget
, curl
, dos2unix
, chmod --reference=/etc/hosts
, chown --reference=/etc/hosts
. Only when the file is complete, I would mv
it to replace /etc/hosts
. This feature of rename(2)
is relevant:
If
newpath
already exists, it will be atomically replaced, so that there is no point at which another process attempting to accessnewpath
will find it missing.
So any process would find either the old /etc/hosts
(before mv
) or the new one (after mv
). Your current approach, directly working with /etc/hosts
allows scenarios when another process finds the file incomplete or with wrong line endings near its end.