How many bytes does one Unicode character take?

I am a bit confused about encodings. As far as I know old ASCII characters took one byte per character. How many bytes does a Unicode character require?

I assume that one Unicode character can contain every possible character from any language - am I correct? So how many bytes does it need per character?

And what do UTF-7, UTF-6, UTF-16 etc. mean? Are they different versions of Unicode?

I read the Wikipedia article about Unicode but it is quite difficult for me. I am looking forward to seeing a simple answer.

Solution 1:

Strangely enough, nobody pointed out how to calculate how many bytes is taking one Unicode char. Here is the rule for UTF-8 encoded strings:

Binary    Hex          Comments
0xxxxxxx  0x00..0x7F   Only byte of a 1-byte character encoding
10xxxxxx  0x80..0xBF   Continuation byte: one of 1-3 bytes following the first
110xxxxx  0xC0..0xDF   First byte of a 2-byte character encoding
1110xxxx  0xE0..0xEF   First byte of a 3-byte character encoding
11110xxx  0xF0..0xF7   First byte of a 4-byte character encoding

So the quick answer is: it takes 1 to 4 bytes, depending on the first one which will indicate how many bytes it'll take up.

Solution 2:

You won't see a simple answer because there isn't one.

First, Unicode doesn't contain "every character from every language", although it sure does try.

Unicode itself is a mapping, it defines codepoints and a codepoint is a number, associated with usually a character. I say usually because there are concepts like combining characters. You may be familiar with things like accents, or umlauts. Those can be used with another character, such as an a or a u to create a new logical character. A character therefore can consist of 1 or more codepoints.

To be useful in computing systems we need to choose a representation for this information. Those are the various unicode encodings, such as utf-8, utf-16le, utf-32 etc. They are distinguished largely by the size of of their codeunits. UTF-32 is the simplest encoding, it has a codeunit that is 32bits, which means an individual codepoint fits comfortably into a codeunit. The other encodings will have situations where a codepoint will need multiple codeunits, or that particular codepoint can't be represented in the encoding at all (this is a problem for instance with UCS-2).

Because of the flexibility of combining characters, even within a given encoding the number of bytes per character can vary depending on the character and the normalization form. This is a protocol for dealing with characters which have more than one representation (you can say "an 'a' with an accent" which is 2 codepoints, one of which is a combining char or "accented 'a'" which is one codepoint).

Solution 3:

I know this question is old and already has an accepted answer, but I want to offer a few examples (hoping it'll be useful to someone).

As far as I know old ASCII characters took one byte per character.

Right. Actually, since ASCII is a 7-bit encoding, it supports 128 codes (95 of which are printable), so it only uses half a byte (if that makes any sense).

How many bytes does a Unicode character require?

Unicode just maps characters to codepoints. It doesn't define how to encode them. A text file does not contain Unicode characters, but bytes/octets that may represent Unicode characters.

I assume that one Unicode character can contain every possible character from any language - am I correct?

No. But almost. So basically yes. But still no.

So how many bytes does it need per character?

Same as your 2nd question.

And what do UTF-7, UTF-6, UTF-16 etc mean? Are they some kind Unicode versions?

No, those are encodings. They define how bytes/octets should represent Unicode characters.

A couple of examples. If some of those cannot be displayed in your browser (probably because the font doesn't support them), go to http://codepoints.net/U+1F6AA (replace 1F6AA with the codepoint in hex) to see an image.

- U+0061 LATIN SMALL LETTER A: a
  - Nº: 97
  - UTF-8: 61
  - UTF-16: 00 61
- U+00A9 COPYRIGHT SIGN: ©
  - Nº: 169
  - UTF-8: C2 A9
  - UTF-16: 00 A9
- U+00AE REGISTERED SIGN: ®
  - Nº: 174
  - UTF-8: C2 AE
  - UTF-16: 00 AE
- U+1337 ETHIOPIC SYLLABLE PHWA: ጷ
  - Nº: 4919
  - UTF-8: E1 8C B7
  - UTF-16: 13 37
- U+2014 EM DASH: —
  - Nº: 8212
  - UTF-8: E2 80 94
  - UTF-16: 20 14
- U+2030 PER MILLE SIGN: ‰
  - Nº: 8240
  - UTF-8: E2 80 B0
  - UTF-16: 20 30
- U+20AC EURO SIGN: €
  - Nº: 8364
  - UTF-8: E2 82 AC
  - UTF-16: 20 AC
- U+2122 TRADE MARK SIGN: ™
  - Nº: 8482
  - UTF-8: E2 84 A2
  - UTF-16: 21 22
- U+2603 SNOWMAN: ☃
  - Nº: 9731
  - UTF-8: E2 98 83
  - UTF-16: 26 03
- U+260E BLACK TELEPHONE: ☎
  - Nº: 9742
  - UTF-8: E2 98 8E
  - UTF-16: 26 0E
- U+2614 UMBRELLA WITH RAIN DROPS: ☔
  - Nº: 9748
  - UTF-8: E2 98 94
  - UTF-16: 26 14
- U+263A WHITE SMILING FACE: ☺
  - Nº: 9786
  - UTF-8: E2 98 BA
  - UTF-16: 26 3A
- U+2691 BLACK FLAG: ⚑
  - Nº: 9873
  - UTF-8: E2 9A 91
  - UTF-16: 26 91
- U+269B ATOM SYMBOL: ⚛
  - Nº: 9883
  - UTF-8: E2 9A 9B
  - UTF-16: 26 9B
- U+2708 AIRPLANE: ✈
  - Nº: 9992
  - UTF-8: E2 9C 88
  - UTF-16: 27 08
- U+271E SHADOWED WHITE LATIN CROSS: ✞
  - Nº: 10014
  - UTF-8: E2 9C 9E
  - UTF-16: 27 1E
- U+3020 POSTAL MARK FACE: 〠
  - Nº: 12320
  - UTF-8: E3 80 A0
  - UTF-16: 30 20
- U+8089 CJK UNIFIED IDEOGRAPH-8089: 肉
  - Nº: 32905
  - UTF-8: E8 82 89
  - UTF-16: 80 89
- U+1F4A9 PILE OF POO: 💩
  - Nº: 128169
  - UTF-8: F0 9F 92 A9
  - UTF-16: D8 3D DC A9
- U+1F680 ROCKET: 🚀
  - Nº: 128640
  - UTF-8: F0 9F 9A 80
  - UTF-16: D8 3D DE 80

Okay I'm getting carried away...

Fun facts:

If you're looking for a specific character, you can copy&paste it on http://codepoints.net/.
I wasted a lot of time on this useless list (but it's sorted!).
MySQL has a charset called "utf8" which actually does not support characters longer than 3 bytes. So you can't insert a pile of poo, the field will be silently truncated. Use "utf8mb4" instead.
There's a snowman test page (unicodesnowmanforyou.com).

How many bytes does one Unicode character take?

Solution 1:

Solution 2:

Solution 3:

Related

Recent Posts