How to convert \uXXXX unicode to UTF-8 using console tools in *nix

I use curl to get some URL response, it's JSON response and it contains unicode-escaped national characters like \u0144 (ń) and \u00f3 (ó).

How can I convert them to UTF-8 or any other encoding to save into file?


Solution 1:

Might be a bit ugly, but echo -e should do it:

echo -en "$(curl $URL)"

-e interprets escapes, -n suppresses the newline echo would normally add.

Note: The \u escape works in the bash builtin echo, but not /usr/bin/echo.

As pointed out in the comments, this is bash 4.2+, and 4.2.x have a bug handling 0x00ff/17 values (0x80-0xff).

Solution 2:

I don't know which distribution you are using, but uni2ascii should be included.

$ sudo apt-get install uni2ascii

It only depend on libc6, so it's a lightweight solution (uni2ascii i386 4.18-2 is 55,0 kB on Ubuntu)!

Then to use it:

$ echo 'Character 1: \u0144, Character 2: \u00f3' | ascii2uni -a U -q
Character 1: ń, Character 2: ó

Solution 3:

I found native2ascii from JDK as the best way to do it:

native2ascii -encoding UTF-8 -reverse src.txt dest.txt

Detailed description is here: http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/native2ascii.html

Update: No longer available since JDK9: https://bugs.openjdk.java.net/browse/JDK-8074431

Solution 4:

Assuming the \u is always followed by exactly 4 hex digits:

#!/usr/bin/perl

use strict;
use warnings;

binmode(STDOUT, ':utf8');

while (<>) {
    s/\\u([0-9a-fA-F]{4})/chr(hex($1))/eg;
    print;
}

The binmode puts standard output into UTF-8 mode. The s... command replaces each occurrence of \u followed by 4 hex digits with the corresponding character. The e suffix causes the replacement to be evaluated as an expression rather than treated as a string; the g says to replace all occurrences rather than just the first.

You can save the above to a file somewhere in your $PATH (don't forget the chmod +x). It filters standard input (or one or more files named on the command line) to standard output.

Again, this assumes that the representation is always \u followed by exactly 4 hex digits. There are more Unicode characters than can be represented that way, but I'm assuming that \u12345 would denote the Unicode character 0x1234 (ETHIOPIC SYLLABLE SEE) followed by the digit 5.

In C syntax, a universal-character-name is either \u followed by exactly 4 hex digits, or \U followed by exactly 8 hexadecimal digits. I don't know whether your JSON responses use the same scheme. You should probably find out how (or whether) it encodes Unicode characters outside the Basic Multilingual Plane (the first 216 characters).

Solution 5:

now I have the best answer! Use jq

Windows:

type in.json | jq > out.json

Lunix:

cat in.json | jq > out.json

It's surely faster as any answer using perl/python. Without parameters it formats the json and converts \uXXXX to utf8. It can be used to do json queries too. Very nice tool!