Linux shell command to grep Unicode Character 'ZERO WIDTH SPACE' (U+200B)?

How can I grep for Unicode Character 'ZERO WIDTH SPACE' (U+200B) in Linux?

$ grep '%U200B' filename?

First, let's print one:

$ printf %b '\u200b' | uniname
character  byte       UTF-32   encoded as     glyph   name
        0          0  00200B   E2 80 8B               ZERO WIDTH SPACE

Command uniname is part of the Unicode utilities.

Now we should be able to use the same format to search for it (using Bash):

$ printf %b '\u200b' | grep -q "$(printf %b '\u200b')"
$ echo $?
0

The trick here is that printf %b treats the arguments as encoded characters, so you can use \x to print single-byte characters and \u* to print multi-byte characters.

To find it in a file, simply do this:

grep "$(printf %b '\u200b')" filename

* The POSIX specification isn't actually clear on how %b works. The printf page says "The %b conversion specification [...] has been added here as a portable way to process -escapes expanded in string operands as provided by the echo utility", and the echo page shows a single undocumented example of its use.

Test:

$ printf %b '\u200b' > test.txt
$ grep -q "$(printf %b '\u200b')" test.txt
$ echo $?
0

The following, works fine. I created the file with BabelMap(google) and used it's save option.

Created file w/line nums 1-5 and on line 4 added the zero len space:

> hexdump testout.txt -C                 
00000000  31 0a 32 0a 32 0a 33 0a  34 20 e2 80 8b 0a 35 0a  |1.2.2.3.4 ....5.|
00000010

Note the utf8 encoding of the char 'e2808b' in the file.

This simple grep finds the correct line:

> grep $'\u200b' testout.txt  
4 
> grep $'\u200b' testout.txt|hexdump -C
00000000  34 20 e2 80 8b 0a                                 |4 ....|
00000006

FWIW, my GREP_OPTIONS are set: "--color=auto -I -D skip -d skip", but I don't think any of them are relevant.

You can also use Perl regex with GNU grep

grep --perl-regexp '\x{200B}' filename

In macos it is trickier, as the BSD grep that comes with it doesn't support multibyte. However GNU grep can be installed via Homebrew where it is made available as ggrep.

Linux shell command to grep Unicode Character 'ZERO WIDTH SPACE' (U+200B)?

Related

Recent Posts