How can I identify non-ASCII characters from the shell?
Is there a simple way to print all non-ASCII characters and the line numbers on which they occur in a file using a command line utility such as grep
, awk
, perl
, etc?
I want to change the encoding of a text file from UTF-8 to ASCII, but before doing so, wish to manually replace all instances of non-ASCII characters to avoid unexpected character changes effected by the file conversion routine.
Solution 1:
$ perl -ne 'print "$. $_" if m/[\x80-\xFF]/' utf8.txt
2 Pour être ou ne pas être
4 Byť či nebyť
5 是或不
or
$ grep -n -P '[\x80-\xFF]' utf8.txt
2:Pour être ou ne pas être
4:Byť či nebyť
5:是或不
where utf8.txt is
$ cat utf8.txt
To be or not to be.
Pour être ou ne pas être
Om of niet zijn
Byť či nebyť
是或不
Solution 2:
I want to change the encoding of a text file from UTF-8 to ASCII ...
... replace all instances of non-ASCII characters ...
Then tell your conversion tool to do so.
$ iconv -c -f UTF-8 -t ASCII <<< 'Look at 私.'
Look at .
$ iconv -c -f UTF-8 -t ASCII//translit <<< 'áēìöų'
aeiou