What's the difference between ASCII and Unicode?

What's the exact difference between Unicode and ASCII?

ASCII has a total of 128 characters (256 in the extended set).

Is there any size specification for Unicode characters?


Solution 1:

ASCII defines 128 characters, which map to the numbers 0–127. Unicode defines (less than) 221 characters, which, similarly, map to numbers 0–221 (though not all numbers are currently assigned, and some are reserved).

Unicode is a superset of ASCII, and the numbers 0–127 have the same meaning in ASCII as they have in Unicode. For example, the number 65 means "Latin capital 'A'".

Because Unicode characters don't generally fit into one 8-bit byte, there are numerous ways of storing Unicode characters in byte sequences, such as UTF-32 and UTF-8.

Solution 2:

Understanding why ASCII and Unicode were created in the first place helped me understand the differences between the two.

ASCII, Origins

As stated in the other answers, ASCII uses 7 bits to represent a character. By using 7 bits, we can have a maximum of 2^7 (= 128) distinct combinations*. Which means that we can represent 128 characters maximum.

Wait, 7 bits? But why not 1 byte (8 bits)?

The last bit (8th) is used for avoiding errors as parity bit. This was relevant years ago.

Most ASCII characters are printable characters of the alphabet such as abc, ABC, 123, ?&!, etc. The others are control characters such as carriage return, line feed, tab, etc.

See below the binary representation of a few characters in ASCII:

0100101 -> % (Percent Sign - 37)
1000001 -> A (Capital letter A - 65)
1000010 -> B (Capital letter B - 66)
1000011 -> C (Capital letter C - 67)
0001101 -> Carriage Return (13)

See the full ASCII table over here.

ASCII was meant for English only.

What? Why English only? So many languages out there!

Because the center of the computer industry was in the USA at that time. As a consequence, they didn't need to support accents or other marks such as á, ü, ç, ñ, etc. (aka diacritics).

ASCII Extended

Some clever people started using the 8th bit (the bit used for parity) to encode more characters to support their language (to support "é", in French, for example). Just using one extra bit doubled the size of the original ASCII table to map up to 256 characters (2^8 = 256 characters). And not 2^7 as before (128).

10000010 -> é (e with acute accent - 130)
10100000 -> á (a with acute accent - 160)

The name for this "ASCII extended to 8 bits and not 7 bits as before" could be just referred as "extended ASCII" or "8-bit ASCII".

As @Tom pointed out in his comment below there is no such thing as "extended ASCII" yet this is an easy way to refer to this 8th-bit trick. There are many variations of the 8-bit ASCII table, for example, the ISO 8859-1, also called ISO Latin-1.

Unicode, The Rise

ASCII Extended solves the problem for languages that are based on the Latin alphabet... what about the others needing a completely different alphabet? Greek? Russian? Chinese and the likes?

We would have needed an entirely new character set... that's the rational behind Unicode. Unicode doesn't contain every character from every language, but it sure contains a gigantic amount of characters (see this table).

You cannot save text to your hard drive as "Unicode". Unicode is an abstract representation of the text. You need to "encode" this abstract representation. That's where an encoding comes into play.

Encodings: UTF-8 vs UTF-16 vs UTF-32

This answer does a pretty good job at explaining the basics:

  • UTF-8 and UTF-16 are variable length encodings.
  • In UTF-8, a character may occupy a minimum of 8 bits.
  • In UTF-16, a character length starts with 16 bits.
  • UTF-32 is a fixed length encoding of 32 bits.

UTF-8 uses the ASCII set for the first 128 characters. That's handy because it means ASCII text is also valid in UTF-8.

Mnemonics:

  • UTF-8: minimum 8 bits.
  • UTF-16: minimum 16 bits.
  • UTF-32: minimum and maximum 32 bits.

Note:

Why 2^7?

This is obvious for some, but just in case. We have seven slots available filled with either 0 or 1 (Binary Code). Each can have two combinations. If we have seven spots, we have 2 * 2 * 2 * 2 * 2 * 2 * 2 = 2^7 = 128 combinations. Think about this as a combination lock with seven wheels, each wheel having two numbers only.

Source: Wikipedia, this great blog post and Mocki.co where I initially posted this summary.

Solution 3:

ASCII has 128 code points, 0 through 127. It can fit in a single 8-bit byte, the values 128 through 255 tended to be used for other characters. With incompatible choices, causing the code page disaster. Text encoded in one code page cannot be read correctly by a program that assumes or guessed at another code page.

Unicode came about to solve this disaster. Version 1 started out with 65536 code points, commonly encoded in 16 bits. Later extended in version 2 to 1.1 million code points. The current version is 6.3, using 110,187 of the available 1.1 million code points. That doesn't fit in 16 bits anymore.

Encoding in 16-bits was common when v2 came around, used by Microsoft and Apple operating systems for example. And language runtimes like Java. The v2 spec came up with a way to map those 1.1 million code points into 16-bits. An encoding called UTF-16, a variable length encoding where one code point can take either 2 or 4 bytes. The original v1 code points take 2 bytes, added ones take 4.

Another variable length encoding that's very common, used in *nix operating systems and tools is UTF-8, a code point can take between 1 and 4 bytes, the original ASCII codes take 1 byte the rest take more. The only non-variable length encoding is UTF-32, takes 4 bytes for a code point. Not often used since it is pretty wasteful. There are other ones, like UTF-1 and UTF-7, widely ignored.

An issue with the UTF-16/32 encodings is that the order of the bytes will depend on the endian-ness of the machine that created the text stream. So add to the mix UTF-16BE, UTF-16LE, UTF-32BE and UTF-32LE.

Having these different encoding choices brings back the code page disaster to some degree, along with heated debates among programmers which UTF choice is "best". Their association with operating system defaults pretty much draws the lines. One counter-measure is the definition of a BOM, the Byte Order Mark, a special codepoint (U+FEFF, zero width space) at the beginning of a text stream that indicates how the rest of the stream is encoded. It indicates both the UTF encoding and the endianess and is neutral to a text rendering engine. Unfortunately it is optional and many programmers claim their right to omit it so accidents are still pretty common.