Weird grep behavior with CJK characters? (bash)
grep fails to match certain strings with CJK characters. For example.
- Create a text file with content below:
==ShellType.サモナ\u30FC==
- Use grep.
>> grep "ShellType.サモナ\u30FC" test.txt
(empty output)
>> grep "ShellType.サモナ.*\u30FC" test.txt
==ShellType.サモナ\u30FC==
Is this a grep bug or CJK characters need special handling?
How to properly search with CJK strings with grep, or other reliable tools?
System: Ubuntu 20.04
GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)
grep (GNU grep) 3.4
It has nothing to do with CJK. You can use -o
to (more or less) see what \u
actually means in grep
:
[tom@ideapad ~]$ cat /tmp/meh
==ShellType.サモナ\u30FC==
[tom@ideapad ~]$ grep -o '\u' /tmp/meh
u
[tom@ideapad ~]$ grep -o '.\u' /tmp/meh
\u
[tom@ideapad ~]$ grep -o '.*\u' /tmp/meh
==ShellType.サモナ\u
[tom@ideapad ~]$ grep -o '.*.*\u' /tmp/meh
==ShellType.サモナ\u
[tom@ideapad ~]$ grep -o '==ShellType.サモナ.*\u' /tmp/meh
==ShellType.サモナ\u
[tom@ideapad ~]$ grep -o '==ShellType.サモナ.\u' /tmp/meh
==ShellType.サモナ\u
Note that I've been using single quotes since with \
, double quotes could make things even more complicated. The proper way to do the grep you (seem to) desire are:
[tom@ideapad ~]$ grep -o '==ShellType\.サモナ\\u' /tmp/meh
==ShellType.サモナ\u
[tom@ideapad ~]$ grep -o "==ShellType\\.サモナ\\\\u" /tmp/meh
==ShellType.サモナ\u
As far as I know, grep does not consider \u30FC
(however further escaped) to be a unicode character like printf
in a shell does. To actually grep one with its code point, you can make the shell expand it first with ANSI-C quoting (it might not work in every POSIX shell though):
[tom@ideapad ~]$ printf '\u30FC' > /tmp/heh
[tom@ideapad ~]$ grep $'\u30FC' /tmp/heh
ー
P.S. It might be worth mentioning that, while ANSI-C quoting makes use of single quotes in its syntax, it does NOT mean that it works like single quotes for the parts other than the code point expansion:
[tom@ideapad ~]$ grep -o $'==ShellType\.サモナ\\u' /tmp/meh
[tom@ideapad ~]$ grep -o $'==ShellType\\.サモナ\\\\u' /tmp/meh
==ShellType.サモナ\u