Non-ISO extended-ASCII text
when i tried to know the encoding of the file all.txt using
$ file all.txt
it shows this message
all.txt: Non-ISO extended-ASCII text, with very long lines
what is the kind of this encoding Non-ISO extended-ASCII text?
because i need to convert it to another encoding so i need to know the encoding of this file
any help?
It is something that does not look like either utf-8 or iso-8859-1. It might be anything else. It may even not be a text at all. This type is kind of fall-back description for anything that does not contain zero bytes.
Even if it actually is a text file (the extension suggests it might be), there is unfortunately no automatic way to find out the encoding, because most encodings have the same range of valid codes. Utf-8 can be told apart with very high confidence, but beyond that it requires manual checking.
First you have to find out what language the file is in to get some idea what is correct content and what is garbled content and to have a list of possible encodings. Because there are zillions of encodings, but only few were used for any particular language.
Than you need to try converting the file from each possible encoding and for each conversion that succeeds technically (which unfortunately will be most of them) view the result and check whether it is correct or not.
A spell-checker might help you with the review, since incorrect conversions will lead to more spell checker errors.
For the conversion, you can use iconv
(1), which is installed from libc package on GNU/Linux or recode
. recode
has more options and better error handling.
This won't fit into a comment, so here it goes: I too had a strange file on my hands:
$ file systeminfo.txt systeminfo.txt: Non-ISO extended-ASCII text
I knew this was generated by a German WindowsXP installation and contained some umlauts but iconv
was not able to convert it to something sensible:
$ iconv -t UTF-8 systeminfo.txt > systeminfo_utf8.txt iconv: illegal input sequence at position 308
But since iconv
knows so many encoding I used a brute force approach to find out a working source encoding:
$ iconv --list | sed 's/\/\/$//' | sort > encodings.list $ for a in `cat encodings.list`; do printf "$a " iconv -f $a -t UTF-8 systeminfo.txt > /dev/null 2>&1 \ && echo "ok: $a" || echo "fail: $a" done | tee result.txt
Then I would go through result.txt
and look for the encoding that didn't fail. In my case, -f CP850 -t UTF-8
worked just fine, and the umlauts are still there, only now encoded in UTF-8 :-)
I shortened the script by ckujau like this:
#!/bin/bash
iconv --list | sed -e 's/\/\///g' | while read encoding
do
transcoded=$(head -n1 strange-encoding.txt | iconv -sc -f $encoding -t UTF-8)
echo "$encoding $transcoded"
done
So when I have a file with an unknown character encoding:
$ cat strange-encoding.txt
B�rbel
and I'm expecting this to be the german female given name "Bärbel" I can find out all matching encodings with
$ ./check_encodings.sh | grep "Bärbel"
437 Bärbel
850 Bärbel
851 Bärbel
852 Bärbel
857 Bärbel
861 Bärbel
865 Bärbel
CP-HU Bärbel
CP437 Bärbel
CP770 Bärbel
CP773 Bärbel
CP774 Bärbel
CP775 Bärbel
CP850 Bärbel
CP851 Bärbel
CP852 Bärbel
CP857 Bärbel
CP861 Bärbel
CP865 Bärbel
CPIBM861 Bärbel
CSIBM851 Bärbel
CSIBM857 Bärbel
CSIBM865 Bärbel
CSPC8CODEPAGE437 Bärbel
CSPC775BALTIC Bärbel
CSPC850MULTILINGUAL Bärbel
CSPCP852 Bärbel
CWI-2 Bärbel
CWI Bärbel
IBM437 Bärbel
IBM775 Bärbel
IBM850 Bärbel
IBM851 Bärbel
IBM852 Bärbel
IBM857 Bärbel
IBM861 Bärbel
IBM865 Bärbel
OSF100201B5 Bärbel
Thanks to ckujau!