Filling file with 0xFF gives C3BF in OSX
Solution 1:
Straight to the point.
It all hinges on the LANG
or LC_ALL
value set in your terminal session when you run tr
. Linux has them set to C
while macOS has it set to something like en_US.UTF-8
. Of course that en_US
could be some other local language such as en_UK
(UK English) but the point is the [something].UTF-8
setting instead of plain ASCII via C
is what is causing this.
More details.
Seems that tr
in macOS is converting the 0xff
to the UTF8 equivalent of c3bf
when it gets instead of the pure ASCII 0xff
. This is explained here on this Apple community support thread here:
Linux doesn't handle Unicode in the Terminal like the Mac does. If you set the "LANG" environment variable to "C" (as it probably is on Linux), it will work. Otherwise, all those high-order bits are going to get interpreted as Unicode characters.
And using that LANG
tip works! Just do the following; tested personally by me just now on macOS 10.13.6 (High Sierra).
First, make note of what the existing LANG
value is like this:
echo $LANG
The output I see is:
en_US.UTF-8
Now set the LANG
value to C
like this:
LANG=C
And run that command again:
dd if=/dev/zero ibs=1k count=100 | tr "\000" "\377" >paddedFile.bin
Now the hexdump
values should look like this:
hexdump -C paddedFile.bin
00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
*
00019000
To reset the LANG
value just close that terminal session or just run this command:
LANG=en_US.UTF-8
Or—as pointed out in the comments—you can just set the LANG
value straight in the command line options before calling tr
like this:
dd if=/dev/zero ibs=1k count=100 | LANG=C tr "\000" "\377" >paddedFile.bin
And you can even use LC_ALL
instead of LANG
because LANG
is just derived from LC_ALL
anyway like this:
dd if=/dev/zero ibs=1k count=100 | LC_ALL=C tr "\000" "\377" >paddedFile.bin
Solution 2:
The issue is that GNU tr
, which you have on Linux, doesn't really have a concept of multibyte characters, but instead works byte at a time.
The tr
man page and online documentation speak of characters, but that's a bit of a simplification. The TODO
file in the source code package mentions this item (picked from coreutils 8.30):
Adapt tools like wc, tr, fmt, etc. (most of the textutils) to be multibyte aware. The problem is that I want to avoid duplicating significant blocks of logic, yet I also want to incur only minimal (preferably 'no') cost when operating in single-byte mode.
On a Linux system—even with a UTF-8 locale (en_US.UTF-8
)—GNU tr
replaces an ä
as two "characters" (the UTF-8 representation of ä
has two bytes):
linux$ echo 'ä' | tr 'ä' 'x'
xx
In the same vein, mixing an ä
and an ö
produces funny results, since their UTF-8 representations share a common byte:
linux$ echo 'ö' | tr ä x
x�
Or the other way around (the x
doesn't apply here):
linux$ echo ab | tr ab äx
ä
And in your case, GNU tr
takes the \377
as a raw byte value.
The tr
on Mac is different, it knows the concept of multibyte characters and acts accordingly:
mac$ echo 'ä' | tr ä x
x
mac$ echo ab | tr ab äx
äx
The UTF-8 representation of the character with numerical value 0377 (U+00ff) is the two bytes c3 bf
, so that's what you get.
The easy way to have tr
work byte-by-byte is to have it use the C locale, instead of a UTF-8 locale. This gives the funny behavior again:
$ echo 'ä' | LC_ALL=C tr 'ä' 'x'
xx
And in your case, you can use:
... | LC_ALL=C tr "\000" "\377"
Or you could use something like Perl to generate those \xff
bytes:
perl -e 'printf "\377" x 1000 for 1..100'