Is there a difference between en_US.utf8 and en_US.UTF-8?
Solution 1:
TL;DR:
The codepage / character set .utf8
in en_US.utf8
is not officially recognised as far as I can tell. There is no IANA utf8
character set name. utf8
is likely generated by glibc
- see final heading.
The IANA character set name is UTF-8
.
- The hyphen is important
- Case is insensitive
Therefore, these are all valid:
en_US.utf-8
en_US.UTF-8
en_US.uTf-8
There is also a !case-sensitive! alias for the name UTF-8
, namely: csUTF8
.
Therefore, this would also be valid:
en_US.csUTF8
But I have never seen this in the wild.
The details, with chapter and verse
UTF-8
is a valid IANA character set name, whereas utf8
is not. It's not even a valid alias.
POSIX.1-2017, section 8.2 Internationalization Variables says:
If the locale value has the form:
language[_territory][.codeset]
it refers to an implementation-provided locale, where settings of language, territory, and codeset are implementation-defined.
Here the part in question is the [.codeset]
part, which POSIX doesn't define, but IANA does.
For the character set defined by RFC2978: UTF-8, a transformation format of ISO 10646
, the
IANA Character Sets lists the name as:
UTF-8
and the note at the top says:
These are the official names for character sets that may be used in the Internet and may be referred to in Internet documentation.
An alias csUTF8
is provided, about which RFC2978 IANA Charset Registration Procedures, section 2.3 says:
All other names are considered to be aliases for the primary name and use of the primary name is preferred over use of any of the aliases.
IANA Character Sets also says:
The "cs" stands for character set and is provided for applications that need a lower case first letter but want to use mixed case thereafter that cannot contain any special characters, such as underbar ("_") and dash ("-").
In the cs
alias, the case is significant (while the name is defined as case insensitive, above).
Given the alias csUTF8
, en_US.csUTF8
would also be valid, but I have never seen this format in the wild.
While case matters in aliases, regarding names, IANA Character Sets says:
The character set names may be up to 40 characters taken from the printable characters of US-ASCII. However, no distinction is made between use of upper and lower case letters.
So while en_US.utf-8
is valid (a lowercase version of the listed UTF-8
), en_US.utf8
doesn't refer to a IANA character set as it drops the -
.
If it's not IANA, where does utf8
likely come from?
glibc's _nl_normalize_codeset()
does the following:
-
Only passes characters or a digits (goodbye hyphen)
-
Converts characters to lowercase
for (cnt = 0; cnt < name_len; ++cnt) if (__isalpha_l ((unsigned char) codeset[cnt], locale)) *wp++ = __tolower_l ((unsigned char) codeset[cnt], locale); else if (__isdigit_l ((unsigned char) codeset[cnt], locale)) *wp++ = codeset[cnt];