Program to check/look up UTF-8/Unicode characters in string on command line?
I've just realized I have a file on my system; it lists normally:
$ ls -la TΕSТER.txt
-rw-r--r-- 1 user user 8 2013-04-11 18:07 TΕSТER.txt
$ cat TΕSТER.txt
testing
... yet, it crashes a piece of software with a UTF-8/Unicode related error. I was really puzzled, since I couldn't tell why such a file is a problem; and finally I remembered to check the output of ls
with hexdump
:
$ ls TΕSТER.txt
TΕSТER.txt
$ ls TΕSТER.txt | hexdump -C
00000000 54 ce 95 53 d0 a2 45 52 2e 74 78 74 0a |T..S..ER.txt.|
0000000d
... Well, obviously there are some bytes in between/instead of some letters, so I guess it is a Unicode encoding problem. And I can try to echo the bytes back to see what is printed:
$ echo -e "\x54\xCE\x95\x53\xD0\xA2\x45\x52\x2E\x74\x78\x74"
TΕSТER.txt
... but I still cannot tell which - if any - Unicode characters these are.
So is there a command line tool, which I can to inspect a string on the terminal, and get Unicode information about it's characters?
Try using uniname, part of the uniutils package on Debian and Ubuntu systems. Here's an example of uniname in action:
echo -e "\x54\xCE\x95\x53\xD0\xA2\x45\x52\x2E\x74\x78\x74" | uniname
No LINES variable in environment so unable to determine lines per page.
Using default of 24.
character byte UTF-32 encoded as glyph name
0 0 000054 54 T LATIN CAPITAL LETTER T
1 1 000395 CE 95 Ε GREEK CAPITAL LETTER EPSILON
2 3 000053 53 S LATIN CAPITAL LETTER S
3 4 000422 D0 A2 Т CYRILLIC CAPITAL LETTER TE
4 6 000045 45 E LATIN CAPITAL LETTER E
5 7 000052 52 R LATIN CAPITAL LETTER R
6 8 00002E 2E . FULL STOP
7 9 000074 74 t LATIN SMALL LETTER T
8 10 000078 78 x LATIN SMALL LETTER X
9 11 000074 74 t LATIN SMALL LETTER T
10 12 00000A 0A LINE FEED (LF)
Well, I looked a bit on the net, and found a one-liner ugrep
in Look up a unicode character by name | commandlinefu.com; but that doesn't help me much here.
Then I saw codecs – String encoding and decoding - Python Module of the Week, which does have a lot of options - but not much related to Unicode character names.
So finally I coded a small tool utfinfo.pl
, which only accepts input on stdin:
- http://sdaaubckp.svn.sourceforge.net/viewvc/sdaaubckp/single-scripts/utfinfo.pl
... which gives me the following information:
$ ls TΕSТER.txt | perl utfinfo.pl
Got 10 uchars
Char: 'T' u: 84 [0x0054] b: 84 [0x54] n: LATIN CAPITAL LETTER T [Basic Latin]
Char: 'Ε' u: 917 [0x0395] b: 206,149 [0xCE,0x95] n: GREEK CAPITAL LETTER EPSILON [Greek and Coptic]
Char: 'S' u: 83 [0x0053] b: 83 [0x53] n: LATIN CAPITAL LETTER S [Basic Latin]
Char: 'Т' u: 1058 [0x0422] b: 208,162 [0xD0,0xA2] n: CYRILLIC CAPITAL LETTER TE [Cyrillic]
Char: 'E' u: 69 [0x0045] b: 69 [0x45] n: LATIN CAPITAL LETTER E [Basic Latin]
Char: 'R' u: 82 [0x0052] b: 82 [0x52] n: LATIN CAPITAL LETTER R [Basic Latin]
Char: '.' u: 46 [0x002E] b: 46 [0x2E] n: FULL STOP [Basic Latin]
Char: 't' u: 116 [0x0074] b: 116 [0x74] n: LATIN SMALL LETTER T [Basic Latin]
Char: 'x' u: 120 [0x0078] b: 120 [0x78] n: LATIN SMALL LETTER X [Basic Latin]
Char: 't' u: 116 [0x0074] b: 116 [0x74] n: LATIN SMALL LETTER T [Basic Latin]
... which then identifies which characters are not the "plain" ASCII ones.
Hope this helps someone,
Cheers!
lets work on an outside ASCII char, for instance: á the bytes from á,
echo -n 'á' | xxd
the unicode from á
echo -en 'á' | iconv -f utf-8 -t UNICODEBIG | xxd -g 2
so in your filename case we have
echo -e "\x54\xCE\x95\x53\xD0\xA2\x45\x52\x2E\x74\x78\x74" | iconv -f utf-8 -t UNICODEBIG | xxd -g 2
showing that the unicode for the capital E is \u0395 which seems to be the same symbol draw of the ASCII \x45