What is the proper way to URL encode Unicode characters?

Solution 1:

I would always encode in UTF-8. From the Wikipedia page on percent encoding:

The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.

It seems like because there were other accepted ways of doing URL encoding in the past, browsers attempt several methods of decoding a URI, but if you're the one doing the encoding you should use UTF-8.

Solution 2:

The general rule seems to be that browsers encode form responses according to the content-type of the page the form was served from. This is a guess that if the server sends us "text/xml; charset=iso-8859-1", then they expect responses back in the same format.

If you're just entering a URL in the URL bar, then the browser doesn't have a base page to work on and therefore just has to guess. So in this case it seems to be doing utf-8 all the time (since both your inputs produced three-octet form values).

The sad truth is that AFAIK there's no standard for what character set the values in a query string, or indeed any characters in the URL, should be interpreted as. At least in the case of values in the query string, there's no reason to suppose that they necessarily do correspond to characters.

It's a known problem that you have to tell your server framework which character set you expect the query string to be encoded as--- for instance, in Tomcat, you have to call request.setEncoding() (or some similar method) before you call any of the request.getParameter() methods. The dearth of documentation on this subject probably reflects the lack of awareness of the problem amongst many developers. (I regularly ask Java interviewees what the difference between a Reader and an InputStream is, and regularly get blank looks)