How can I find all UTF-16 encoded text files in a directory tree with a Unix command?

I want to use a Unix shell command to find all UTF-16 encoded files (containing the UTF-16 Byte Order Mark (BOM)) in a directory tree. Is there a command that I can use?


Solution 1:

Though you asked to find the BOM, using file might even give you results when such BOM is not present. From man file:

If a file does not match any of the entries in the magic file, it is examined to see if it seems to be a text file. ASCII, ISO-8859-x, non-ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set. If a file passes any of these tests, its character set is reported.

Hence, for example:

find . -type f -exec file --mime {} \; | grep "charset=utf-16"

Solution 2:

You can use grep:

 grep -rl $(echo -ne '^\0376\0377') *

(Tested with bash and GNU grep, might work with others.)

Explanation:

The $(echo... part generates the BOM (Hex FE FF, as octal escape sequences), this is then fed to grep as its pattern, prepended with '^' (=match start of line).

-r is recursive search, -l makes grep print the names of files it found (instead of the matching line).

This might be a bit wasteful, as grep will scan each file completely, rather than just the start. If it's mostly small text files, it will not matter. If you have loads of files with several MB, you'll have to write a perl script :-).

Alternatively, you could try file (combined with find+xargs). file will identify UTF-16 (as "UTF-16 Unicode character data"). I don't know how reliable it is, however (as it uses heuristics).

Solution 3:

Here is the script that I use to find UTF-16 files, and subsequently convert them to UTF-8. #!/bin/sh

find ./ -type f |
while read file; do
    if [ "`head -c 2 -- "$file"`" == $'\xff\xfe' ]
    then
        echo "Problems with: $file"
        # If you want to convert to UTF-8 uncomment these lines.
        #cat "$file" | iconv -f UTF-16 -t UTF-8 > "$file.tmp"
        #mv -f "$file.tmp" "$file"
    fi
done