What characters must be escaped in an HTTP query string?
This question concerns the characters in the query string portion of the URL, which appear after the ?
mark character.
Per Wikipedia, certain characters are left as is and others are encoded (usually with a %
escape sequence).
I've been trying to track this down to actual specifications, so that I understand the justification behind every bullet point in that Wikipedia page.
Contradiction Example 1:
The HTML specification says to encode space as +
and defers the rest to RFC1738. However, this RFC says that ~
is unsafe and furthermore that "[a]ll unsafe characters must always be encoded within the URL". This seems to contradict Wikipedia.
In practice, IE8 encodes ~
in the query strings it generates, while FF3 leaves it as is.
Contradiction Example 2:
Wikipedia states that all characters that it does not mention must be encoded. !
is not mentioned in Wikipedia. But RFC1738 states that !
is a "special" character and "may be used unencoded". This seems to contradict Wikipedia which says that it must be encoded.
In practice, IE8 encodes !
in the query strings it generates, while FF3 leaves it as is.
I understand that the moral of this is probably going to be to encode those characters that are in doubt between Wikipedia and the specifications. Perhaps even going as far as encoding everything that is not [A-Za-z0-9]. I would just like to know the actual standards on this.
Conclusions
The algorithm described on Wikipedia encodes precisely those characters which are not RFC3986 unreserved characters. That is, it encodes all characters other than alphanumerics and -._~
. As a special case, space is encoded as +
instead of %20
per RFC3986.
Some applications use an older RFC. For comparison, the RFC2396 unreserved characters are alphanumerics and !'()*-._~
.
For comparison, the HTML5 working draft algorithm encodes all characters other than alphanumerics and *-._
. The special case encoding for space remains +
. Notable differences are that *
is not encoded and ~
is encoded. (Technically, this handling of *
is compatible with RFC3986 even though *
is in reserved
because it is in the sub-delims
which are allowed in the query
production.)
Solution 1:
The answer lies in the RFC 3986 document, specifically Section 3.4.
The query component is indicated by the first question mark ("?") character and terminated by a number sign ("#") character or by the end of the URI.
...
The characters slash ("/") and question mark ("?") may represent data within the query component.
Technically, RFC 3986-3.4 defines the query component as:
query = *( pchar / "/" / "?" )
This syntax means that query can include all characters from pchar
as well as /
and ?
. pchar
refers to another specification of path characters. Helpfully, Appendix A of RFC 3986 lists the relevant ABNF definitions, most notably:
query = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
Thus, in addition to all alphanumerics and percent encoded characters, a query can legally include the following unencoded characters:
/ ? : @ - . _ ~ ! $ & ' ( ) * + , ; =
Of course, you may want to keep in mind that '=' and '&' usually have special significance within a query.