Why do English characters require fewer bytes to represent than other alphabets?

When I put 'a' in a text file, it makes it 2 bytes but when I put, let's say 'ա', which is a letter from Armenian alphabet, it makes it 3 bytes.

What is the difference between alphabets for a computer?
Why does English take less space?

One of the first encoding schemes to be developed to use in mainstream computers is the ASCII (American Standard Code for Information Interchange) standard. It was developed in the 1960's in the United States.

The English alphabet uses part of the Latin alphabet (for instance, there are few accented words in English). There are 26 individual letters in that alphabet, not considering case. And there would also have to exist the individual numbers and punctuation marks in any scheme that pretends to encode the English alphabet.

The 1960's was also a time where computers didn't have the amount of memory or disk space that we have now. ASCII was developed to be a standard representation of a functional alphabet across all American computers. At the time, the decision to make every ASCII character to be 8 bits (1 byte) long was made due to technical details of the time (the Wikipedia article mentions the fact that perforated tape held 8 bits in a position at a time). In fact, the original ASCII scheme can be transmitted using 7 bits, the eight could be used for parity checks. Later developments expanded the original ASCII scheme to include several accented, mathematical and terminal characters.

With the recent increase of computer usage across the world, more and more people from different languages had access to a computer. That meant that, for each language, new encoding schemes had to be developed, independently from other schemes, which would conflict if read from different language terminals.

Unicode came as a solution to the existence of different terminals, by merging all possible meaningful characters into a single abstract character set.

UTF-8 is one way to encode the Unicode character set. It is a variable-width encoding (e.g. different characters can have different sizes) and it was designed for backwards compatibility with the former ASCII scheme. As such, the ASCII character set will remain to be one byte big whilst any other characters are two or more bytes big. UTF-16 is another way to encode the Unicode character set. In comparison to UTF-8, characters are encoded as either a set of one or two 16-bit code units.

As stated on comments, the 'a' character occupies a single byte while 'ա' occupies two bytes, denoting a UTF-8 encoding. The extra byte in your question was due to the existence of a newline character at the end (which the OP found out about).

1 byte is 8 bits, and can thus represent up to 256 (2^8) different values.

For languages that require more possibilities than this, a simple 1 to 1 mapping can't be maintained, so more data is needed to store a character.

Note that generally, most encodings use the first 7 bits (128 values) for ASCII characters. That leaves the the 8th bit, or 128 more values for more characters . . . add in accented characters, Asian languages, Cyrillic, etc, and you can easily see why 1 byte is not sufficient for keeping all characters.

In UTF-8, ASCII characters use one byte, other characters use two, three, or four bytes.

The amount of bytes required for a character (which the question is apparently about) depends on the character encoding. If you use the ArmSCII encoding, each Armenian letter occupies just one byte. It’s not a good choice these days, though.

In the UTF-8 transfer encoding for Unicode, characters need a different number of bytes. In it, “a” takes just one byte (the idea about two bytes is some kind of a confusion), “á” takes two bytes, and the Armenian letter ayb “ա” takes two bytes too. Three bytes must be some kind of a confusion. In contrast, e.g. Bengali letter a “অ” takes three bytes in UTF-8.

The background is simply that UTF-8 was designed to be very efficient for Ascii characters, fairly efficient for writing systems in Europe and surroundings, and all the rest is less efficient. This means that basic Latin letters (which is what English text mostly consists of), only one byte is needed for a character; for Greek, Cyrillic, Armenian, and a few others, two bytes are needed; all the rest needs more.

UTF-8 has (as pointed out in a comment) also the useful property that Ascii data (when represented as 8-bit units, which has been almost the only way for a long time) is trivially UTF-8 encoded, too.

Character codes in the 1960es (and long beyond) were machine-specific. In the 1980s I briefly used a DEC 2020 machine, which had 36 bit words, and 5, 6 and 8 (IIRC) bits per character encodings. Before that, I used an IBM 370 series with EBCDIC. ASCII with 7 bits brought order, but it got a mess with IBM PC "codepages" using all 8 bits to represent extra characters, like all sorts of box drawing ones to paint primitive menus, and later ASCII extensions like Latin-1 (8 bit encodings, with the first 7 bits like ASCII and the other half for "national characters" like ñ, Ç, or others. Probably the most popular was Latin-1, tailored to English and most european languages using Latin characters (and accents and variants).

Writing text mixing e.g. English and Spanish went fine (just use Latin-1, superset of both), but mixing anything that used a different encodings (say include a snippet of Greek, or Russian, not to mention an asian language like Japanese) was a veritable nightmare. Worst was that Russian and particularly Japanese and Chinese had several popular, completely incompatible encodings.

Today we use Unicode, which is cupled to efficient encodings like UTF-8 that favor English characters (surprisingly, the encoding for English letters just so happen to correspond to ASCII) thus making many non-English characters use longer encodings.

Why do English characters require fewer bytes to represent than other alphabets?

Related

Recent Posts