Firefox displays garbage characters in lieu of web page

On this web page http://taj.chass.ncsu.edu/Hindi.Less.05/dialog_script.html, firefox and opera running on windows and linux display the source html interspersed with garbage characters (for me, they appear as black diamonds with question marks), as opposed to a rendered web page.

Out of all the browsers I've tried, only Internet Explorer displays the page properly. I'd very much like to be able to use the web site with Firefox running on Linux. In order to try to get the page to display properly, I've attempted to manually set the character encoding to every value available, but have not had any success. Do you guys have any other suggestions?


Solution 1:

In firefox, use

View->Character Encoding->More Encodings->UTF-16.

Hope that helps.

Most computer text is encoded as either ascii or 8-bit Unicode (UTF-8)

For more info on UTF-16 specifically, check here.

In general, if you see the in firefox, use some "intelligent guessing" and try changing character encodings. Usually this works, occasionally though, particularly with linux firefox, you may run into font issues.

Solution 2:

Though one can indeed manually choose some encoding (and not forget to disable that when visiting another site), actually the web site should have correctly specified it. Either the server or the web pages themselves should specify something, for otherwise all the browser can do is make some best guess. And of course, if an encoding is specified, then the HTML document should in fact use that encoding. Not so much for the web site from the question, as shown below:

To see if the web server specified something one needs to look at the so-called headers. Using the online service from web-sniffer.net to reveal the headers you'll get:

HTTP/1.1 200 OK

Date:           Mon, 17 Aug 2009 17:47:03 GMT   
Server:         Apache  
Last-Modified:  Mon, 27 Nov 2006 23:38:49 GMT   
ETag:           "758b0606-1a316-4234309151440"  
Accept-Ranges:  bytes   
Content-Length: 107286  
Connection:     close   
Content-Type:   text/html; charset=utf-8 (BOM UTF-16, litte-endian)

The last line seems a bit odd: how can the server claim something to be both UTF-8 and UTF-16? The value for charset should be one of those registered with IANA (so, for example, UTF-8 without any comments). However, using the Wireshark packet sniffer rather than the online service reveals that the text (BOM UTF-16, litte-endian) is in fact a comment from the online service, not sent by the web server.

So: the web server claims it's going to send us a UTF-8 encoded HTML document.

However, the HTML document that follows is wrong (edited for readability):

ÿþ<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
    <title>Lesson 5</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    <link href="main.css" rel="stylesheet" type="text/css">
  </head>
...

Above, the line specifying the content type should be the first to appear within the <head>, for otherwise the browser wouldn't know how to handle special characters in the <title>. More important, the first two odd characters, ÿþ, are in fact the hexadecimal codes FF and FE, which like the online service already noted, is the Byte-Order Mark for UTF-16, litte-endian.

So: the web server promised to send UTF-8 but then it sent markers that indicated UTF-16 LE. Next, in the HTML document, it claims to be using UTF-8 again.

Indeed, Wireshark shows that the actual HTML document is UTF-16 encoded. This implies that every character is sent using at least two bytes (octets). Like the 6 characters in <html> are sent as the 12 hexadecimal bytes 3C 00 68 00 74 00 6D 00 6C 00 3E 00. However, this very web site could very well have been plain ASCII, as it doesn't seem to use any non-ASCII characters at all. Instead, the HTML source is full of numeric character references (NCRs), such as:

&#2351;&#2361; &#2342;&#2367;&#2354;&#2381;&#2354;&#2368;
&#2358;&#2361;&#2352; &#2361;&#2376;&#2404;

A browser displays the above as यह दिल्ली शहर है।. However, due to using NCRs and UTF-16, the single character य (Unicode U+092F) requires as many as 14 bytes in 26 00 23 00 32 00 33 00 35 00 31 00 3B 00, because it is written using NCR &#2351; while the 7 ASCII characters of the NCR itself are encoded using UTF-16. When not using NCRs, in UTF-8 this single य would require 3 bytes (E0 A4 AF), and in UTF-16 two bytes (09 2F).

For this HTML source using UTF-16 is a total waste of bandwidth, and the server is not using any compression either.