What are the safe characters for making URLs?
I am making a website with articles, and I need the articles to have "friendly" URLs, based on the title.
For example, if the title of my article is "Article Test"
, I would like the URL to be http://www.example.com/articles/article_test
.
However, article titles (as any string) can contain multiple special characters that would not be possible to put literally in my URL. For instance, I know that ?
or #
need to be replaced, but I don't know all the others.
What characters are permissible in URLs? What is safe to keep?
To quote section 2.3 of RFC 3986:
Characters that are allowed in a URI, but do not have a reserved purpose, are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde.
ALPHA DIGIT "-" / "." / "_" / "~"
Note that RFC 3986 lists fewer reserved punctuation marks than the older RFC 2396.
There are two sets of characters you need to watch out for: reserved and unsafe.
The reserved characters are:
- ampersand ("&")
- dollar ("$")
- plus sign ("+")
- comma (",")
- forward slash ("/")
- colon (":")
- semi-colon (";")
- equals ("=")
- question mark ("?")
- 'At' symbol ("@")
- pound ("#").
The characters generally considered unsafe are:
- space (" ")
- less than and greater than ("<>")
- open and close brackets ("[]")
- open and close braces ("{}")
- pipe ("|")
- backslash ("\")
- caret ("^")
- percent ("%")
I may have forgotten one or more, which leads to me echoing Carl V's answer. In the long run you are probably better off using a "white list" of allowed characters and then encoding the string rather than trying to stay abreast of characters that are disallowed by servers and systems.
Always Safe
In theory and by the specification, these are safe basically anywhere, except the domain name. Percent-encode anything not listed, and you're good to go.
A-Z a-z 0-9 - . _ ~ ( ) ' ! * : @ , ;
Sometimes Safe
Only safe when used within specific URL components; use with care.
Paths: + & =
Queries: ? /
Fragments: ? / # + & =
Never Safe
According to the URI specification (RFC 3986), all other characters must be percent-encoded. This includes:
<space> <control-characters> <extended-ascii> <unicode>
% < > [ ] { } | \ ^
If maximum compatibility is a concern, limit the character set to A-Z a-z 0-9 - _ . (with periods only for filename extensions).
Keep Context in Mind
Even if valid per the specification, a URL can still be "unsafe", depending on context. Such as a file:/// URL containing invalid filename characters, or a query component containing "?", "=", and "&" when not used as delimiters. Correct handling of these cases are generally up to your scripts and can be worked around, but it's something to keep in mind.
You are best keeping only some characters (whitelist) instead of removing certain characters (blacklist).
You can technically allow any character, just as long as you properly encode it. But, to answer in the spirit of the question, you should only allow these characters:
- Lower case letters (convert upper case to lower)
- Numbers, 0 through 9
- A dash - or underscore _
- Tilde ~
Everything else has a potentially special meaning. For example, you may think you can use +, but it can be replaced with a space. & is dangerous, too, especially if using some rewrite rules.
As with the other comments, check out the standards and specifications for complete details.