What is the most common encoding of each language?

I am developing a plain-text reader application. Sometimes app can't auto determine the encoding of a file, so user needs select an encoding from a list of encodings. If this list contains all supported encodings, it will be too long. I want to provide a simplified list, only contains most common encodings of each language.

This is some relationship I am known:

  • Traditional Chinese: Big5
  • Simplified Chinese: GB18030
  • Japanese: Shift-JIS, EUC-JP
  • Russian: KOI8-R

If you know any other language's most common encoding, please tell me.


Solution 1:

On the web, UTF-8 is by far the most common encoding for all languages.

That being said, here are the Windows XP locales grouped by default character encoding ("Language for non-Unicode programs"):

  • Big5: zh_HK, zh_MO, zh_TW
  • GBK (≈GB2312): zh_CN, zh_SG
  • Windows-31J (≈Shift_JIS): ja_JP
  • windows-874 (≈TIS-620, ISO-8859-11): th_TH
  • windows-949 (≈EUC-KR): ko_KR
  • windows-1250: bs_BA, cs_CZ, hr_BA, hr_HR, hu_HU, pl_PL, ro_RO, sk_SK, sl_SI, sq_AL, sr_BA, sr_SP
  • windows-1251: az_AZ, be_BY, bg_BG, kk_KZ, ky_KG, mk_MK, mn_MN, ru_RU, sr_BA, sr_SP, tt_RU, uk_UA, uz_UZ
  • windows-1252 (≈ISO-8859-1): af_ZA, arn_CL, ca_ES, cy_GB, da_DK, de_AT, de_CH, de_DE, de_LI, de_LU, en_AU, en_BZ, en_CA, en_CB, en_GB, en_IE, en_JM, en_NZ, en_PH, en_TT, en_US, en_ZA, en_ZW, es_AR, es_BO, es_CL, es_CO, es_CR, es_DO, es_EC, es_ES, es_GT, es_HN, es_MX, es_NI, es_PA, es_PE, es_PR, es_PY, es_SV, es_UY, es_VE, eu_ES, fi_FI, fil_PH, fo_FO, fr_BE, fr_CA, fr_CH, fr_FR, fr_LU, fr_MC, fy_NL, ga_IE, gl_ES, id_ID, is_IS, it_CH, it_IT, iu_CA, iv_IV, lb_LU, moh_CA, ms_BN, ms_MY, nb_NO, nl_BE, nl_NL, nn_NO, ns_ZA, pt_BR, pt_PT, qu_BO, qu_EC, qu_PE, rm_CH, se_FI, se_NO, se_SE, sv_FI, sv_SE, sw_KE, tn_ZA, xh_ZA, zu_ZA
  • windows-1253: el_GR
  • windows-1254 (≈ISO-8859-9): az_AZ, tr_TR, uz_UZ
  • windows-1255: he_IL
  • windows-1256: ar_AE, ar_BH, ar_DZ, ar_EG, ar_IQ, ar_JO, ar_KW, ar_LB, ar_LY, ar_MA, ar_OM, ar_QA, ar_SA, ar_SY, ar_TN, ar_YE, fa_IR, ps_AF, ur_PK
  • windows-1257: et_EE, lt_LT, lv_LV
  • windows-1258: vi_VN

and the most common encodings overall on the Web as of October 30th 2020:

  1. UTF-8 95.7%
  2. ISO-8859-1 1.8%
  3. Windows-1251 1.0%
  4. Windows-1252 0.4%
  5. GB2312 0.3%
  6. Shift JIS 0.2%
  7. GBK 0.1%
  8. EUC-KR 0.1%
  9. ISO-8859-9 0.1%
  10. Windows-1254 0.1%
  11. EUC-JP 0.1%
  12. Big5 0.1%

Solution 2:

The HTML5 draft contains a table of default encodings for languages, reflecting what is regarded as common. However, note that it is supposed to be based on the user locale, i.e. the language of the browser or the operating system, not the language of the document—obviously because the latter is usually unknown, at least before you actually read the document, based on some assumption about the encoding.

I think you could in practice copy the list of encodings in a popular web browser. If it works well there, it probably works reasonably well in your application. Browsers do some clever things with the list and its order, but in practice, I think it would suffice to have a short list like utf-8, utf-16, windows-1252, and maybe a few others, followed by an option of getting the full list. Note that although utf-16 is practically unused and useless for web pages, it is common for plain text files around. It is important to name the encodings well, preferably with a common English (or other language) name together with the IANA “charset” name in parentheses—much like browsers do.

Solution 3:

I would recommend the menu structure like the one used by browsers. For instance Firefox: View -> Character Encoding -> More Encoding -> East Asian -> Chinese/Japanese/Korean. (ok, easier if you just look). And View -> Encoding -> More in IE.

Might seem too deep and clunky, but it is very familiar. And does not drop useful encodings (Why KOI8-R for Russian, for instance? And what happens if I use Windows 1251 and is not in the list?)