Valid characters for directory part of a URL (for short links)
Solution 1:
A path segment (the parts in a path separated by /
) in an absolute URI path can contain zero or more of pchar that is defined as follows:
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
So it’s basically A
–Z
, a
–z
, 0
–9
, -
, .
, _
, ~
, !
, $
, &
, '
, (
, )
, *
, +
, ,
, ;
, =
, :
, @
, as well as %
that must be followed by two hexadecimal digits. Any other character/byte needs to be encoded using the percent-encoding.
Although these are 79 characters in total that can be used in a path segment literally, some user agents do encode some of these characters as well (e.g. %7E
instead of ~
). That’s why many use just the 62 alphanumeric characters (i.e. A
–Z
, a
–z
, 0
–9
) or the Base 64 Encoding with URL and Filename Safe Alphabet (i.e. A
–Z
, a
–z
, 0
–9
, -
, _
).
Solution 2:
According to RFC 3986 the valid characters for the path component are:
a-z A-Z 0-9 . - _ ~ ! $ & ' ( ) * + , ; = : @
as well as percent-encoded characters and of course, the slash /
.
Keep in mind, though, that many applications (not necessarily browsers) that attempt to parse URIs to make them clickable, for example, may support a much smaller set of characters. This is akin to parsing e-mail addresses where most attempts also don't catch all addresses allowed by the standard.