Unicode characters in URLs

In 2010, would you serve URLs containing UTF-8 characters in a large web portal?

Unicode characters are forbidden as per the RFC on URLs (see here). They would have to be percent encoded to be standards compliant.

My main point, though, is serving the unencoded characters for the sole purpose of having nice-looking URLs, so percent encoding is out.

All major browsers seem to be parsing those URLs okay no matter what the RFC says. My general impression, though, is that it gets very shaky when leaving the domain of web browsers:

  • URLs getting copy+pasted into text files, E-Mails, even Web sites with a different encoding
  • HTTP Client libraries
  • Exotic browsers, RSS readers

Is my impression correct that trouble is to be expected here, and thus it's not a practical solution (yet) if you're serving a non-technical audience and it's important that all your links work properly even if quoted and passed on?

Is there some magic way of serving nice-looking URLs in HTML

http://www.example.com/düsseldorf?neighbourhood=Lörick

that can be copy+pasted with the special characters intact, but work correctly when re-used in older clients?


Use percent encoding. Modern browsers will take care of display & paste issues and make it human-readable. E. g. http://ko.wikipedia.org/wiki/위키백과:대문

Edit: when you copy such an url in Firefox, the clipboard will hold the percent-encoded form (which is usually a good thing), but if you copy only a part of it, it will remain unencoded.


What Tgr said. Background:

http://www.example.com/düsseldorf?neighbourhood=Lörick

That's not a URI. But it is an IRI.

You can't include an IRI in an HTML4 document; the type of attributes like href is defined as URI and not IRI. Some browsers will handle an IRI here anyway, but it's not really a good idea.

To encode an IRI into a URI, take the path and query parts, UTF-8-encode them then percent-encode the non-ASCII bytes:

http://www.example.com/d%C3%BCsseldorf?neighbourhood=L%C3%B6rick

If there are non-ASCII characters in the hostname part of the IRI, eg. http://例え.テスト/, they have be encoded using Punycode instead.

Now you have a URI. It's an ugly URI. But most browsers will hide that for you: copy and paste it into the address bar or follow it in a link and you'll see it displayed with the original Unicode characters. Wikipedia have been using this for years, eg.:

http://en.wikipedia.org/wiki/ɸ

The one browser whose behaviour is unpredictable and doesn't always display the pretty IRI version is...

...well, you know.


Depending on your URL scheme, you can make the UTF-8 encoded part "not important". For example, if you look at Stack Overflow URLs, they're of the following form:

http://stackoverflow.com/questions/2742852/unicode-characters-in-urls

However, the server doesn't actually care if you get the part after the identifier wrong, so this also works:

http://stackoverflow.com/questions/2742852/これは、これを日本語のテキストです

So if you had a layout like this, then you could potentially use UTF-8 in the part after the identifier and it wouldn't really matter if it got garbled. Of course this probably only works in somewhat specialised circumstances...