How to auto detect text file encoding?

There are many plain text files which were encoded in variant charsets.

I want to convert them all to UTF-8, but before running iconv, I need to know its original encoding. Most browsers have an Auto Detect option in encodings, however, I can't check those text files one by one because there are too many.

Only having known the original encoding, I then can convert the texts by iconv -f DETECTED_CHARSET -t utf-8.

Is there any utility to detect the encoding of plain text files? It DOES NOT have to be 100% perfect, I don't mind if there're 100 files misconverted in 1,000,000 files.

Try the chardet Python module, which is available on PyPI:

pip install chardet

Then run chardetect myfile.txt.

Chardet is based on the detection code used by Mozilla, so it should give reasonable results, provided that the input text is long enough for statistical analysis. Do read the project documentation.

As mentioned in comments it is quite slow, but some distributions also ship the original C++ version as @Xavier has found in https://superuser.com/a/609056. There is also a Java version somewhere.

I would use this simple command:

encoding=$(file -bi myfile.txt)

Or if you want just the actual character set (like utf-8):

encoding=$(file -b --mime-encoding myfile.txt)

On Debian-based Linux, the uchardet package (Debian / Ubuntu) provides a command line tool. See below the package description:

 universal charset detection library - cli utility
 .
 uchardet is a C language binding of the original C++ implementation
 of the universal charset detection library by Mozilla.
 .
 uchardet is a encoding detector library, which takes a sequence of
 bytes in an unknown character encoding without any additional
 information, and attempts to determine the encoding of the text.
 .
 The original code of universalchardet is available at
 http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet
 .
 Techniques used by universalchardet are described at
 http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

For Linux, there is enca and for Solaris you can use auto_ef.

How to auto detect text file encoding?

Related

Recent Posts