Linux shell command to grep Unicode Character 'ZERO WIDTH SPACE' (U+200B)?
How can I grep
for Unicode Character 'ZERO WIDTH SPACE' (U+200B) in Linux?
$ grep '%U200B' filename?
First, let's print one:
$ printf %b '\u200b' | uniname
character byte UTF-32 encoded as glyph name
0 0 00200B E2 80 8B ZERO WIDTH SPACE
Command uniname
is part of the Unicode utilities.
Now we should be able to use the same format to search for it (using Bash):
$ printf %b '\u200b' | grep -q "$(printf %b '\u200b')"
$ echo $?
0
The trick here is that printf %b
treats the arguments as encoded characters, so you can use \x
to print single-byte characters and \u
* to print multi-byte characters.
To find it in a file, simply do this:
grep "$(printf %b '\u200b')" filename
* The POSIX specification isn't actually clear on how %b
works. The printf
page says "The %b conversion specification [...] has been added here as a portable way to process -escapes expanded in string operands as provided by the echo utility", and the echo
page shows a single undocumented example of its use.
Test:
$ printf %b '\u200b' > test.txt
$ grep -q "$(printf %b '\u200b')" test.txt
$ echo $?
0
The following, works fine. I created the file with BabelMap(google) and used it's save option.
Created file w/line nums 1-5 and on line 4 added the zero len space:
> hexdump testout.txt -C
00000000 31 0a 32 0a 32 0a 33 0a 34 20 e2 80 8b 0a 35 0a |1.2.2.3.4 ....5.|
00000010
Note the utf8 encoding of the char 'e2808b' in the file.
This simple grep finds the correct line:
> grep $'\u200b' testout.txt
4
> grep $'\u200b' testout.txt|hexdump -C
00000000 34 20 e2 80 8b 0a |4 ....|
00000006
FWIW, my GREP_OPTIONS are set: "--color=auto -I -D skip -d skip", but I don't think any of them are relevant.
You can also use Perl regex with GNU grep
grep --perl-regexp '\x{200B}' filename
In macos
it is trickier, as the BSD grep that comes with it doesn't support multibyte. However GNU grep can be installed via Homebrew where it is made available as ggrep
.