What do I need to know about Unicode? [closed]

Solution 1:

Unicode is a standard that defines numeric codes for glyphs used in written communication. Or, as they say it themselves:

The standard for digital representation of the characters used in writing all of the world's languages. Unicode provides a uniform means for storing, searching, and interchanging text in any language. It is used by all modern computers and is the foundation for processing text on the Internet. Unicode is developed and maintained by the Unicode Consortium.

There are many common, yet easily avoided, programming errors committed by developers who don't bother to educate themselves about Unicode and its encodings.

  • First, go to the source for authoritative, detailed information and implementation guidelines.
  • As mentioned by others, Joel Spolsky has a good list of these errors.
  • I also like Elliotte Rusty Harold's Ten Commandments of Unicode.
  • Developers should also watch out for canonical representation attacks.

Some of the key concepts you should be aware of are:

  • Glyphs—concrete graphics used to represent written characters.
  • Composition—combining glyphs to create another glyph.
  • Encoding—converting Unicode points to a stream of bytes.
  • Collation—locale-sensitive comparison of Unicode strings.

Solution 2:

At the risk of just adding another link, unicode.org is a spectacular resource.

In short, it's a replacement for ASCII that's designed to handle, literally, every character ever used by humans. Unicode has everal encoding schemes to handle all those characters - UTF-8, which is more or less the standard these days, works really hard to stay a single byte per character, and is identical to ASCII for the first 7 bits.

(As an addendum, there's a popular misconception amongst programmers that you only need to know about Unicode if you're going to be doing internationalization. While that's certainly one use, it's not the only one. For example, I'm working on a project that will only ever use English text - but with a huge number of fancy math symbols. Moving the whole project over to be fully Unicode solved more problems than I can count.)

Solution 3:

Unicode is an industry agreed standard for consistently representing text that has capacity to represent the World's character systems. All developers need to know about it, as Globalization is a growing concern.

Solution 4:

One (open) source of code for handling Unicode is ICU - Internationalization Components for Unicode. It includes ICU4J for Java and ICU4C for C and C++ (presents C interface; uses C++ compiler).