Can UTF-8 contain zero byte?

Solution 1:

Yes, the zero byte in UTF8 is code point 0, NUL. There is no other Unicode code point that will be encoded in UTF8 with a zero byte anywhere within it.

The possible code points and their UTF8 encoding are:

Range              Encoding  Binary value
-----------------  --------  --------------------------
U+000000-U+00007f  0xxxxxxx  0xxxxxxx

U+000080-U+0007ff  110yyyxx  00000yyy xxxxxxxx
                   10xxxxxx

U+000800-U+00ffff  1110yyyy  yyyyyyyy xxxxxxxx
                   10yyyyxx
                   10xxxxxx

U+010000-U+10ffff  11110zzz  000zzzzz yyyyyyyy xxxxxxxx
                   10zzyyyy
                   10yyyyxx
                   10xxxxxx

You can see that all the non-zero ASCII characters are represented as themselves while all mutibyte sequences have a high bit of 1 in all their bytes.

You may need to be careful that your ascii plaintext protocol doesn't treat non-ASCII characters badly (since that will be all non-ASCII code points).

Solution 2:

ASCII text is restricted to byte values between 0 and 127. UTF-8 text has no such restriction - text encoded with UTF-8 may have its high bit set. So it's not safe to send UTF-8 text over a channel that doesn't guarantee safe passage for that high bit.

If you're forced to deal with an ASCII-only channel, Base-64 is a reasonable (though not particularly space-efficient) choice. Are you sure you're limited to 7-bit data, though? That's somewhat unusual in this day.

Solution 3:

A UTF-8 encoded string can have most values from 0x00 to 0xff in a given byte position for of backing memory (although a few specific combinations are not allowed, see http://en.wikipedia.org/wiki/UTF-8 and the octet values C0, C1, F5 to FF never appear).

If you are transporting across a channel such as an ASCII stream that does not support binary data, you will have to appropriately encode. Base64 is broadly supported and will certainly solve that problem, though it is not entirely efficient since it uses a 64 character space to encode data, whereas ASCII allows for a 128 character space.

There is a sourceforge project that provides base 91 encoding, which is more space efficient while avoiding non-printable characters http://base91.sourceforge.net/