Program to check/look up UTF-8/Unicode characters in string on command line?

I've just realized I have a file on my system; it lists normally:

$ ls -la TΕSТER.txt 
-rw-r--r-- 1 user user 8 2013-04-11 18:07 TΕSТER.txt
$ cat TΕSТER.txt 
testing

... yet, it crashes a piece of software with a UTF-8/Unicode related error. I was really puzzled, since I couldn't tell why such a file is a problem; and finally I remembered to check the output of ls with hexdump:

$ ls TΕSТER.txt 
TΕSТER.txt
$ ls TΕSТER.txt | hexdump -C
00000000  54 ce 95 53 d0 a2 45 52  2e 74 78 74 0a           |T..S..ER.txt.|
0000000d

... Well, obviously there are some bytes in between/instead of some letters, so I guess it is a Unicode encoding problem. And I can try to echo the bytes back to see what is printed:

$ echo -e "\x54\xCE\x95\x53\xD0\xA2\x45\x52\x2E\x74\x78\x74"
TΕSТER.txt

... but I still cannot tell which - if any - Unicode characters these are.

So is there a command line tool, which I can to inspect a string on the terminal, and get Unicode information about it's characters?


Try using uniname, part of the uniutils package on Debian and Ubuntu systems. Here's an example of uniname in action:

echo -e "\x54\xCE\x95\x53\xD0\xA2\x45\x52\x2E\x74\x78\x74" | uniname
No LINES variable in environment so unable to determine lines per page.
Using default of 24.
character  byte       UTF-32   encoded as     glyph   name
        0          0  000054   54             T      LATIN CAPITAL LETTER T
        1          1  000395   CE 95          Ε      GREEK CAPITAL LETTER EPSILON
        2          3  000053   53             S      LATIN CAPITAL LETTER S
        3          4  000422   D0 A2          Т      CYRILLIC CAPITAL LETTER TE
        4          6  000045   45             E      LATIN CAPITAL LETTER E
        5          7  000052   52             R      LATIN CAPITAL LETTER R
        6          8  00002E   2E             .      FULL STOP
        7          9  000074   74             t      LATIN SMALL LETTER T
        8         10  000078   78             x      LATIN SMALL LETTER X
        9         11  000074   74             t      LATIN SMALL LETTER T
       10         12  00000A   0A                     LINE FEED (LF)

Well, I looked a bit on the net, and found a one-liner ugrep in Look up a unicode character by name | commandlinefu.com; but that doesn't help me much here.

Then I saw codecs – String encoding and decoding - Python Module of the Week, which does have a lot of options - but not much related to Unicode character names.

So finally I coded a small tool utfinfo.pl, which only accepts input on stdin:

  • http://sdaaubckp.svn.sourceforge.net/viewvc/sdaaubckp/single-scripts/utfinfo.pl

... which gives me the following information:

$ ls TΕSТER.txt | perl utfinfo.pl 
Got 10 uchars
Char: 'T' u: 84 [0x0054] b: 84 [0x54] n: LATIN CAPITAL LETTER T [Basic Latin]
Char: 'Ε' u: 917 [0x0395] b: 206,149 [0xCE,0x95] n: GREEK CAPITAL LETTER EPSILON [Greek and Coptic]
Char: 'S' u: 83 [0x0053] b: 83 [0x53] n: LATIN CAPITAL LETTER S [Basic Latin]
Char: 'Т' u: 1058 [0x0422] b: 208,162 [0xD0,0xA2] n: CYRILLIC CAPITAL LETTER TE [Cyrillic]
Char: 'E' u: 69 [0x0045] b: 69 [0x45] n: LATIN CAPITAL LETTER E [Basic Latin]
Char: 'R' u: 82 [0x0052] b: 82 [0x52] n: LATIN CAPITAL LETTER R [Basic Latin]
Char: '.' u: 46 [0x002E] b: 46 [0x2E] n: FULL STOP [Basic Latin]
Char: 't' u: 116 [0x0074] b: 116 [0x74] n: LATIN SMALL LETTER T [Basic Latin]
Char: 'x' u: 120 [0x0078] b: 120 [0x78] n: LATIN SMALL LETTER X [Basic Latin]
Char: 't' u: 116 [0x0074] b: 116 [0x74] n: LATIN SMALL LETTER T [Basic Latin]

... which then identifies which characters are not the "plain" ASCII ones.

Hope this helps someone,
Cheers!


lets work on an outside ASCII char, for instance: á the bytes from á,

echo -n 'á' | xxd

the unicode from á

echo -en 'á' | iconv -f utf-8 -t UNICODEBIG | xxd -g 2

so in your filename case we have

echo -e "\x54\xCE\x95\x53\xD0\xA2\x45\x52\x2E\x74\x78\x74"  | iconv -f utf-8 -t UNICODEBIG | xxd -g 2

showing that the unicode for the capital E is \u0395 which seems to be the same symbol draw of the ASCII \x45