Create Files with Different Character Encodings from Z-Shell
I'm trying to understand character encoding better. To experiment, I want to take a string of characters and encode in different ways. Does zsh prompt have a way to create files that use specific character encodings? For example, I'm trying to create files that use the following character encodings:
- ASCII
- Unicode
- UTF32
- UTF8
I'd like to see the same string of characters encoded in each of these encodings to compare and contrast them. Thank you.
As a shell, zsh mostly doesn't deal with such things directly – but you can do it by running other programs through zsh. (Although, sure, zsh in particular is very rich in built-in functions unlike most other shells, but character encoding conversion doesn't seem to be one of them.)
To convert character encodings within the shell, usually the iconv
tool is used – the -f
option specifies what encoding to convert from; -t
the target encoding; and data is read from stdin. For example:
echo "Here are some arrows 🠈, 🠊, 🠉, 🠋" > text_in_utf8.txt
iconv -f utf-8 -t utf-16 < text_in_utf8.txt > text_in_utf16.txt
(Usually the system locale is set to UTF-8, so anything you type directly into the shell – e.g. the echo
in the above example – will also end up being UTF-8. But make sure to check it using locale charmap
.)
Other character encodings such as iso8859-1
or ibm437
are also available; see iconv -l
for a list. (Note that "Unicode" is just the abstract character set but not an encoding in itself. When you see "Unicode" as an encoding in Windows, that actually means UTF-16, or UCS-2 in very old versions).
There are also other ways to experiment with character encodings. For example, in Python you can .encode()
a Unicode string into bytes using a specific encoding, or .decode()
bytes back to a Unicode string:
$ python
>>> text = "🠈 🠊 🠉 🠋"
>>> utf8bytes = text.encode("utf-8")
>>> utf16bytes = text.encode("utf-16")
>>> ["%02x" % b for b in utf8bytes]
['f0', '9f', 'a0', '88', '20', 'f0', '9f', 'a0', '8a', '20',
'f0', '9f', 'a0', '89', '20', 'f0', '9f', 'a0', '8b']
(In this case, I do mean "Unicode string" in the abstract sense, unlike the earlier note.)