What character encoding should I use for a web page containing mostly Arabic text? Is utf-8 okay?

What character encoding should I use for a web page containing mostly Arabic text?

Is utf-8 okay?


Solution 1:

UTF-8 can store the full Unicode range, so it's fine to use for Arabic.


However, if you were wondering what encoding would be most efficient:

All Arabic characters can be encoded using a single UTF-16 code unit (2 bytes), but they may take either 2 or 3 UTF-8 code units (1 byte each), so if you were just encoding Arabic, UTF-16 would be a more space efficient option.

However, you're not just encoding Arabic - you're encoding a significant number of characters that can be stored in a single byte in UTF-8, but take two bytes in UTF-16; all the html encoding characters <,&,>,= and all the html element names.

It's a trade off and, unless you're dealing with huge documents, it doesn't matter.

Solution 2:

I develop mostly Arabic websites and these are the two encodings I use :

1. Windows-1256

This is the most common encoding Arabic websites use. It works in most cases (90%) for Arabic users.

Here is one of the biggest Arabic web-development forums: http://traidnt.net/vb/. You can see that they are using this encoding.

The problem with this encoding is that if you are developing a website for international use, this encoding won't work with every user and they will see gibberish instead of the content.

2. UTF-8

This encoding solves the previous problem and also works in urls. I mean if you want to have Arabic words in the your url, you need them to be in utf-8 or it won't work.

The downside of this encoding is that if you are going to save Arabic content to a database (e.g. MySql) using this encoding (so the database will also be encoded with utf-8) its size is going to be double what it would have been if it were encoded with windows-1256 (so the database will be encoded with latin-1).

I suggest going with utf-8 if you can afford the size increase.