Isn’t on big endian machines UTF-8's byte order different than on little endian machines? So why then doesn’t UTF-8 require a BOM?
UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order.
If Utf-8
stored all code-points in a single byte, then it would make sense why endianness doesn’t play any role and thus why BOM
isn’t required. But since code points 128 and above are stored using 2, 3 and up to 6 bytes, which means their byte order on big endian machines is different than on little endian machines, so how can we claim Utf-8
always has the same byte order?
Thank you
EDIT:
UTF-8 is byte oriented
I understand that if two byte UTF-8
character C
consists of bytes B1
and B2 ( where B1
is first byte and B2
is last byte ), then with UTF-8
those two bytes are always written in the same order ( thus if this character is written to a file on little endian machine LEM
, B1
will be first and B2
last. Similarly, if C
is written to a file on big endian machine BEM
, B1
will still be first and B2
still last).
But what happens when C
is written to file F
on LEM
, but we copy F
to BEM
and try to read it there? Since BEM
automatically swaps bytes ( B1
is now last and B2
first byte ), how will app ( running on BEM
) reading F
know whether F was created on BEM
and thus order of two bytes wasn’t swapped or whether F
was transferred from LEM
, in which case BEM
automatically swapped the bytes?
I hope question made some sense
EDIT 2:
In response to your edit: big-endian machines do not swap bytes if you ask them to read a byte at a time.
a) Oh, so even though character C is 2 bytes longs, app ( residing on BEM ) reading F will read into memory just one byte at the time ( thus it will first read into memory B1 and only then B2 )
b)
In UTF-8, you decide what to do with a byte based on its high-order bits
Assuming file F has two consequent characters C and C1 ( where C consists of bytes B1 and B2 while C1 has bytes B3, B4 and B5 ). How will app reading F know which bytes belong together simply by checking each byte's high-order bits ( for example, how will it figure out that B1 and B2 taken together should represent a character and not B1,*B2* and B3)?
If you believe that you're seeing something different, please edit your question and include
I’m not saying that. I simply didn’t understand what was going on
c)Why aren't Utf-16 and Utf-32 also byte oriented?
Solution 1:
The byte order is different on big endian vs little endian machines for words/integers larger than a byte.
e.g. on a big-endian machine a short integer of 2 bytes stores the 8 most significant bits in the first byte, the 8 least significant bits in the second byte. On a little-endian machine the 8 most significant bits will the second byte, the 8 least significant bits in the first byte.
So, if you write the memory content of such a short int directly to a file/network, the byte ordering within the short int will be different depending on the endianness.
UTF-8 is byte oriented, so there's not an issue regarding endianness. the first byte is always the first byte, the second byte is always the second byte etc. regardless of endianness.
Solution 2:
To answer c): UTF-16 and UTF-32 represent characters as 16-bit or 32-bit words, so they are not byte-oriented.
For UTF-8, the smallest unit is a byte, thus it is byte-oriented. The alogrithm reads or writes one byte at a time. A byte is represented the same way on all machines.
For UTF-16, the smallest unit is a 16-bit word, and for UTF-32, the smallest unit is a 32-bit word. The algorithm reads or writes one word at a time (2 bytes, or 4 bytes). The order of the bytes in each word is different on big-endian and little-endian machines.