Syntax for apache RewriteRule to match %-encoded URLs? (to fix character encoding issues; windows-1252 <=> utf-8 )
I host a webpage that has 'project²
' in the URL, matching an on-disk directory project²
from where static files are hosted.
This page is used by a java-based client to load data from URLs (bioinformatics software IGV).
My page lists URLs in the form of http://localhost:60151/load?file=http://example.org/project²/some/data/file.bam
.
Clicking these links in the browser will cause the IGV client (running on localhost) to request GET http://example.org/project²/some/data/file.bam
from my server.
✅ IGV on Linux/Mac responds by requesting this URL as UTF-8 encoded ²
=%C2%B2
, and everything is happily working.
❌ My newly gained Win-10 user's client requests ²
=%B2
(windows-1252 encoded), resulting in a 404-not-found.
After trying dozens of things, I am at wits end how to help this user.
I have the impression that I should be able to dynamically rewrite the wrongly-encoded URLs on the server-side, so that they still end up serving the desired data, but I do not know the magic character combinations to make the rule-patterns match escaped characters.
Things I've already tried
- Doublechecking that the 404s are not network issues; I see the
GET %B2
in myssl_access_log
with404
as the returned statuscode, so it really is the server doing it. - 'Proper' way: UrlEncoding the URL before giving it into the client. Perl's
URI::Encode
encode_uri
turns the²
into%C3%82%C2%B2
(apparentlyò
?) which is even more wrong somehow? - triple-checked that the webpage providing the load-URLs is served as utf-8
- it provides header
Content-Type: text/html; charset=UTF-8
- Set
AddDefaultCharset UTF-8
inhttpd.conf
- It seems the encoding info is not transferred through from webbrowser API-link-click into the Java program
- it provides header
- 'doubled' the directory by symlinking
andprojectª -> project²
project%B2 -> project²
(edit: ª is in no way related; no idea where I got that fromª
is the UTF8-match for%B2
) - Tried to
mod_rewrite
'bad' URLs into good ones in several different ways, none of which seem to catch:
RewriteEngine on
# RewriteRule Pattern Substitution [flags]
RewriteRule (.*)project%B2/(.*) $1project²/$2 [NE] # encoded 'bad' request, unencoded redirect
RewriteRule (.*)²(.*) $1%C2%B2$2 [B,NE] # config file is utf-8 encoded, so this is senseless.
RewriteRule (.*)%B2(.*) $12$2 [B,NE] # doesn't match?
RewriteRule (.*)TZZT(.*) $1test$2 # works, so RewriteEngine is working
The RewriteRule and RewriteRuleFlags docs also do not help me understand how I should encode the Pattern
-part so that it'll work :-(
Similar questions on here
-
Can Apache .htaccess convert the percent-encoding in encoded URIs from Win-1252 to UTF-8? -> an external encoding program
rewritemap
seems overkill, since it is literally only one folderproject²
, so my scope is smaller. - Rewriting ASCII-percent-encoded locations to their UTF-8 encoded equivalent same problem in NGinX, point to the above Apache-question.
You can't "convert encodings" as such using only mod_rewrite, however, you can search for that specific sequence of characters in the requested URL and "correct it".
http://localhost:60151/load?file=http://example.org/project²/some/data/file.bam
RewriteRule (.*)project%B2/(.*) $1project²/$2 [NE]
Note that project²
appears as part of the query string in the example URL you posted, however, the RewriteRule
pattern (which you are using above) matches against the %-decoded URL-path only (which excludes the query string). To match against the query string you need to use an additional RewriteCond
directive and match against the QUERY_STRING
(or THE_REQUEST
) server variable instead.
Note that the QUERY_STRING
(and THE_REQUEST
) server variable is %-encoded (or rather, as sent from the client) - they have not been %-decoded.
Try the following instead:
RewriteCond %{QUERY_STRING} (.+)/project%B2/(.*)
RewriteRule ^(load)$ $1?%1/project%C2%B2/%2 [NE,L]
The backreferences %1
and %2
in the substitution string refer to the preceding CondPattern - the parts before and after the troublesome /project%B2/
part.
$1
is simply a backreference to the URL-path (to save repetition), which I assume is always load
.
The NE
flag prevents the %
itself (when used as part of the URL-encoded characters) being URL encoded.
UPDATE: I'm afraid my original question was unclear about who GETs which URL, so the "query-string" part of your answer doesn't apply...
If you need to match the %-encoded URL-path then you should match against THE-REQUEST
server variable instead. THE_REQUEST
contains the first line of the HTTP request header and is not %-decoded. It contains the full URL-path (and query string) as sent from the client (as well as the request method and protocol version). For example, in the case of the malformed request, a string of the form:
GET /project%B2/some/data/file.bam HTTP/1.1
Which you could match and correct as follows:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,7}\s(/project)%B2([^\s]+)
RewriteRule ^/?project %1%B2%C2%2 [NE,L]
%1
and %2
are backreferences to the captured subpatterns in the preceding CondPattern.
The RewriteRule
pattern, on the other hand, matches against a pre-processed %-decoded URL-path only (as mentioned above). So, %B2
is whatever that decodes to; assuming a UTF-8 encoding. Unfortunately, this is a non-printable character so would need to be represented by the hex character sequence in the regex, ie. \xb2
(this is PCRE syntax representing a single byte sequence).
Solution
RewriteRule
s must use \x
instead of %
in order to match %-encoded URLs! (PCRE syntax for byte sequences)
mod_rewrite
-config uses PCRE regex syntax, and operates on decoded URLs, so typing a %
-encoding in a RewriteRule
pattern causes it to look for the literal %
-character, not an encoded value.
The correct escape-character in RewriteRules is \x
, so the URLencoded value %B2
can be matched using \xb2
(or \xB2
, it's case-insensitive).
Note that RewriteRule
is a hacky solution for character encoding issues, that only works when there is exactly one specific wrong-encoded character is in a specific, predictable place.
For a general solution for multiple wrong-encoded characters in arbitrary places, please see Can Apache .htaccess convert the percent-encoding in encoded URIs from Win-1252 to UTF-8? , which suggests a general solution using RewriteMap
coupled to an external program in a full-featured programming language.
The proper solution is still to prevent this from the source, using explicit %-encoding throughout the entire chain. This avoids OS-dependent encoding accidentally happening 'somewhere in the middle', outside of your control. (assuming no client along the paths does double-encoding, which should be a punishable offense..)
How I got here
Getting desperate, I upped the server-wide logging using LogLevel Warn rewrite:trace3
as suggested in mod_rewrite docs. This is warned to (heavily) impact server performance, but was manageable because this is a low-traffic server, and there were no pre-existing rewrites.
The additional logging is emitted into (ssl_
)error_log
.
This gave me insight into how exactly the matching was attempted, and what the internal representations for rules and URIs are in mod_rewrite
.
excerpt from ssl_error_log
(many columns ommitted for brevity),
with rule RewriteRule (.*)project%B2/(.*) $1project²/$2 [NE,L]
[rewrite:trace3] applying pattern '(.*)project%B2/(.*)' to uri 'project\xb2/'
[rewrite:trace1] pass through /var/www/html/example.org/project\xb2
Note that the request-uri from client is written \xb2
, but my pattern uses %B2
.
Matching the rule-syntax to the uri-syntax, with rule RewriteRule (.*)project\xB2/(.*) $1project²/$2 [NE,L]
[rewrite:trace3] applying pattern '(.*)project\\xb2/(.*)' to uri 'project\xb2/'
[rewrite:trace2] rewrite 'project\xb2/' -> 'project%c2%b2/'
[rewrite:trace1] internal redirect with /auth-test/project\xc2\xb2/ [INTERNAL REDIRECT]
🎉 success! 🎉 As we can see, we are now matching!
Why no [R]
/[R=302]
flag?
As this is a character-encoding issue, I do not think doing an extra HTTP-round-trip will add value; Every link fed into the client will run into the same issue again, unless I fix the encoding issue before feeding it into the client-side java-program.
Don't forget RewriteBase
Please note that this shortened version omits setting the correct RewriteBase
, which can screw up the rewritten path, depending on where in your conf
it is written (e.g. <Directory>
vs <Location>
). Without RewriteBase
I accidentally redirected to ❌https://example.org/var/www/html/rewrite-testing/project²
instead of ✅https://example.org/rewrite-testing/project²
)