How to tell the language encoding of a filename on Linux?
Solution 1:
There's no 100% accurate way really, but there's a way to give a good guess.
There is a python library chardet which is available here: https://pypi.python.org/pypi/chardet
e.g.
See what the current LANG variable is set to:
$ echo $LANG
en_IE.UTF-8
Create a filename that'll need to be encoded with UTF-8
$ touch mÉ.txt
Change our encoding and see what happens when we try and list it
$ ls m*
mÉ.txt
$ export LANG=C
$ ls m*
m??.txt
OK, so now we have a filename encoded in UTF-8 and our current locale is C (standard Unix codepage).
So start up python, import chardet and get it to read the filename. I'm use some shell globbing (i.e. expansion through the * wildcard character) to get my file. Change "ls m*" to whatever will match one of your example files.
>>> import chardet
>>> import os
>>> chardet.detect(os.popen("ls m*").read())
{'confidence': 0.505, 'encoding': 'utf-8'}
As you can see, it's only a guess. How good a guess is shown by the "confidence" variable.
Solution 2:
You may find this useful, to test the current working directory (python 2.7):
import chardet
import os
for n in os.listdir('.'):
print '%s => %s (%s)' % (n, chardet.detect(n)['encoding'], chardet.detect(n)['confidence'])
Result looks like:
Vorlagen => ascii (1.0)
examples.desktop => ascii (1.0)
Öffentlich => ISO-8859-2 (0.755682154041)
Videos => ascii (1.0)
.bash_history => ascii (1.0)
Arbeitsfläche => EUC-KR (0.99)
To recurse trough path from current directory, cut-and-paste this into a little python script:
#!/usr/bin/python
import chardet
import os
for root, dirs, names in os.walk('.'):
print root
for n in names:
print '%s => %s (%s)' % (n, chardet.detect(n)['encoding'], chardet.detect(n)['confidence'])
Solution 3:
Landing here in 2021 using python3 I found @philip-reynoldsn @klaus-kappel answers useful but not functional anymore as chardet.detect()
expects a byte-like object. I slightly edited the code to get the encoding of all files in current working directory as follows:
import os
import chardet
for n in os.listdir('.'):
chardet.detect(os.fsencode(n))