grepping binary files and UTF16

Solution 1:

The easiest way is to just convert the text file to utf-8 and pipe that to grep:

iconv -f utf-16 -t utf-8 file.txt | grep query

I tried to do the opposite (convert my query to utf-16) but it seems as though grep doesn't like that. I think it might have to do with endianness, but I'm not sure.

It seems as though grep will convert a query that is utf-16 to utf-8/ascii. Here is what I tried:

grep `echo -n query | iconv -f utf-8 -t utf-16 | sed 's/..//'` test.txt

If test.txt is a utf-16 file this won't work, but it does work if test.txt is ascii. I can only conclude that grep is converting my query to ascii.

EDIT: Here's a really really crazy one that kind of works but doesn't give you very much useful info:

hexdump -e '/1 "%02x"' test.txt | grep -P `echo -n Test | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "%02x"'`

How does it work? Well it converts your file to hex (without any extra formatting that hexdump usually applies). It pipes that into grep. Grep is using a query that is constructed by echoing your query (without a newline) into iconv which converts it to utf-16. This is then piped into sed to remove the BOM (the first two bytes of a utf-16 file used to determine endianness). This is then piped into hexdump so that the query and the input are the same.

Unfortunately I think this will end up printing out the ENTIRE file if there is a single match. Also this won't work if the utf-16 in your binary file is stored in a different endianness than your machine.

EDIT2: Got it!!!!

grep -P `echo -n "Test" | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "x%02x"' | sed 's/x/\\\\x/g'` test.txt

This searches for the hex version of the string Test (in utf-16) in the file test.txt

Solution 2:

I found the below solution worked best for me, from

Grep does not play well with Unicode, but it can be worked around. For example, to find,

Some Search Term

in a UTF-16 file, use a regular expression to ignore the first byte in each character,

S.o.m.e. .S.e.a.r.c.h. .T.e.r.m 

Also, tell grep to treat the file as text, using '-a', the final command looks like this,

grep -a 'S.o.m.e. .S.e.a.r.c.h. .T.e.r.m' utf-16-file.txt

Solution 3:

You can explicitly include the nulls (00s) in the search string, though you will get results with nulls, so you may want to redirect the output to a file so you can look at it with a reasonable editor, or pipe it through sed to replace the nulls. To search for "bar" in *.utf16.txt:

grep -Pa "b\x00a\x00r" *.utf16.txt | sed 's/\x00//g'

The "-P" tells grep to accept Perl regexp syntax, which allows \x00 to expand to null, and the -a tells it to ignore the fact that Unicode looks like binary to it.

Solution 4:

I use this one all the time after dumping the Windows registry as its output is unicode. This is running under Cygwin.

$ regedit /e
$ file Little-endian **UTF-16 Unicode text**, with CRLF line terminators

$ sed 's/\x00//g' | egrep "192\.168"
[HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Control\Print\Monitors\Standard TCP/IP Port\Ports\]
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Print\Monitors\Standard TCP/IP Port\Ports\]
[HKEY_USERS\S-1-5-21-2054485685-3446499333-1556621121-1001\Software\Microsoft\Terminal Server Client\Servers\]