How can I delete U+200B (Zero-width space) using sed
I have a very large file that has zero-width spaces scattered throughout. It takes too long to open and edit using vi
so I'd like to delete all instances of the character using sed
. The problem is, I can't figure out how to match the character! I've tried using \u200B
, \x{200b}
. Any ideas?
I'm running CentOS 5 if that helps at all.
Solution 1:
This seems to work for me:
sed 's/\xe2\x80\x8b//g' inputfile
Demonstration:
$ /usr/bin/printf 'X\u200bY\u200bZ' | hexdump -C
00000000 58 e2 80 8b 59 e2 80 8b 5a |X...Y...Z|
$ /usr/bin/printf 'X\u200bY\u200bZ' | sed 's/\xe2\x80\x8b//g' | hexdump -C
00000000 58 59 5a |XYZ|
Edit:
Based partially on Gilles' answer:
tr -d $(/usr/bin/printf "\u200b") < inputfile
Solution 2:
GNU sed's behavior with UTF-8 doesn't seem to be very well-defined. Experimentally, you can make it replace the bytes of the UTF-8 representation:
<old sed 's/\xe2\x80\e8b//g' >new
Alternatively, you can type the character into your shell and use any of the standard commands in a UTF-8 locale:
<old tr -d '' >new
<old sed 's///g' >new
In zsh, you can also enter the character through an escape sequence:
<old tr -d $'\u200B' >new