How can I delete U+200B (Zero-width space) using sed

I have a very large file that has zero-width spaces scattered throughout. It takes too long to open and edit using vi so I'd like to delete all instances of the character using sed. The problem is, I can't figure out how to match the character! I've tried using \u200B, \x{200b}. Any ideas?

I'm running CentOS 5 if that helps at all.

Solution 1:

This seems to work for me:

sed 's/\xe2\x80\x8b//g' inputfile

Demonstration:

$ /usr/bin/printf 'X\u200bY\u200bZ' | hexdump -C
00000000  58 e2 80 8b 59 e2 80 8b  5a                       |X...Y...Z|
$ /usr/bin/printf 'X\u200bY\u200bZ' | sed 's/\xe2\x80\x8b//g' | hexdump -C
00000000  58 59 5a                                          |XYZ|

Edit:

Based partially on Gilles' answer:

tr -d $(/usr/bin/printf "\u200b") < inputfile

Solution 2:

GNU sed's behavior with UTF-8 doesn't seem to be very well-defined. Experimentally, you can make it replace the bytes of the UTF-8 representation:

<old sed 's/\xe2\x80\e8b//g' >new

Alternatively, you can type the character into your shell and use any of the standard commands in a UTF-8 locale:

<old tr -d '' >new
<old sed 's///g' >new

In zsh, you can also enter the character through an escape sequence:

<old tr -d $'\u200B' >new

How can I delete U+200B (Zero-width space) using sed

Solution 1:

Solution 2:

Related

Recent Posts