Syntax for apache RewriteRule to match %-encoded URLs? (to fix character encoding issues; windows-1252 <=> utf-8 )

I host a webpage that has 'project²' in the URL, matching an on-disk directory project² from where static files are hosted.

This page is used by a java-based client to load data from URLs (bioinformatics software IGV). My page lists URLs in the form of http://localhost:60151/load?file=http://example.org/project²/some/data/file.bam. Clicking these links in the browser will cause the IGV client (running on localhost) to request GET http://example.org/project²/some/data/file.bam from my server.

✅ IGV on Linux/Mac responds by requesting this URL as UTF-8 encoded ²=%C2%B2, and everything is happily working.
❌ My newly gained Win-10 user's client requests ²=%B2 (windows-1252 encoded), resulting in a 404-not-found.

After trying dozens of things, I am at wits end how to help this user.

I have the impression that I should be able to dynamically rewrite the wrongly-encoded URLs on the server-side, so that they still end up serving the desired data, but I do not know the magic character combinations to make the rule-patterns match escaped characters.


Things I've already tried

  • Doublechecking that the 404s are not network issues; I see the GET %B2 in my ssl_access_log with 404 as the returned statuscode, so it really is the server doing it.
  • 'Proper' way: UrlEncoding the URL before giving it into the client. Perl's URI::Encode encode_uri turns the ² into %C3%82%C2%B2 (apparently ò?) which is even more wrong somehow?
  • triple-checked that the webpage providing the load-URLs is served as utf-8
    • it provides header Content-Type: text/html; charset=UTF-8
    • Set AddDefaultCharset UTF-8 in httpd.conf
    • It seems the encoding info is not transferred through from webbrowser API-link-click into the Java program
  • 'doubled' the directory by symlinking projectª -> project² and project%B2 -> project² (ª is the UTF8-match for %B2) edit: ª is in no way related; no idea where I got that from
  • Tried to mod_rewrite 'bad' URLs into good ones in several different ways, none of which seem to catch:
RewriteEngine on
# RewriteRule Pattern Substitution [flags]
RewriteRule (.*)project%B2/(.*) $1project²/$2 [NE] # encoded 'bad' request, unencoded redirect
RewriteRule (.*)²(.*) $1%C2%B2$2 [B,NE]            # config file is utf-8 encoded, so this is senseless.      
RewriteRule (.*)%B2(.*) $12$2 [B,NE]               # doesn't match?        
RewriteRule (.*)TZZT(.*) $1test$2                  # works, so RewriteEngine is working

The RewriteRule and RewriteRuleFlags docs also do not help me understand how I should encode the Pattern-part so that it'll work :-(


Similar questions on here

  • Can Apache .htaccess convert the percent-encoding in encoded URIs from Win-1252 to UTF-8? -> an external encoding program rewritemap seems overkill, since it is literally only one folder project², so my scope is smaller.
  • Rewriting ASCII-percent-encoded locations to their UTF-8 encoded equivalent same problem in NGinX, point to the above Apache-question.

You can't "convert encodings" as such using only mod_rewrite, however, you can search for that specific sequence of characters in the requested URL and "correct it".

http://localhost:60151/load?file=http://example.org/project²/some/data/file.bam
RewriteRule (.*)project%B2/(.*) $1project²/$2 [NE]

Note that project² appears as part of the query string in the example URL you posted, however, the RewriteRule pattern (which you are using above) matches against the %-decoded URL-path only (which excludes the query string). To match against the query string you need to use an additional RewriteCond directive and match against the QUERY_STRING (or THE_REQUEST) server variable instead.

Note that the QUERY_STRING (and THE_REQUEST) server variable is %-encoded (or rather, as sent from the client) - they have not been %-decoded.

Try the following instead:

RewriteCond %{QUERY_STRING} (.+)/project%B2/(.*)
RewriteRule ^(load)$ $1?%1/project%C2%B2/%2 [NE,L]

The backreferences %1 and %2 in the substitution string refer to the preceding CondPattern - the parts before and after the troublesome /project%B2/ part.

$1 is simply a backreference to the URL-path (to save repetition), which I assume is always load.

The NE flag prevents the % itself (when used as part of the URL-encoded characters) being URL encoded.

UPDATE: I'm afraid my original question was unclear about who GETs which URL, so the "query-string" part of your answer doesn't apply...

If you need to match the %-encoded URL-path then you should match against THE-REQUEST server variable instead. THE_REQUEST contains the first line of the HTTP request header and is not %-decoded. It contains the full URL-path (and query string) as sent from the client (as well as the request method and protocol version). For example, in the case of the malformed request, a string of the form:

GET /project%B2/some/data/file.bam HTTP/1.1

Which you could match and correct as follows:

RewriteCond %{THE_REQUEST} ^[A-Z]{3,7}\s(/project)%B2([^\s]+)
RewriteRule ^/?project %1%B2%C2%2 [NE,L]

%1 and %2 are backreferences to the captured subpatterns in the preceding CondPattern.

The RewriteRule pattern, on the other hand, matches against a pre-processed %-decoded URL-path only (as mentioned above). So, %B2 is whatever that decodes to; assuming a UTF-8 encoding. Unfortunately, this is a non-printable character so would need to be represented by the hex character sequence in the regex, ie. \xb2 (this is PCRE syntax representing a single byte sequence).


Solution

RewriteRules must use \x instead of % in order to match %-encoded URLs! (PCRE syntax for byte sequences)

mod_rewrite-config uses PCRE regex syntax, and operates on decoded URLs, so typing a %-encoding in a RewriteRule pattern causes it to look for the literal %-character, not an encoded value.
The correct escape-character in RewriteRules is \x, so the URLencoded value %B2 can be matched using \xb2 (or \xB2, it's case-insensitive).

Note that RewriteRule is a hacky solution for character encoding issues, that only works when there is exactly one specific wrong-encoded character is in a specific, predictable place.

For a general solution for multiple wrong-encoded characters in arbitrary places, please see Can Apache .htaccess convert the percent-encoding in encoded URIs from Win-1252 to UTF-8? , which suggests a general solution using RewriteMap coupled to an external program in a full-featured programming language.

The proper solution is still to prevent this from the source, using explicit %-encoding throughout the entire chain. This avoids OS-dependent encoding accidentally happening 'somewhere in the middle', outside of your control. (assuming no client along the paths does double-encoding, which should be a punishable offense..)


How I got here

Getting desperate, I upped the server-wide logging using LogLevel Warn rewrite:trace3 as suggested in mod_rewrite docs. This is warned to (heavily) impact server performance, but was manageable because this is a low-traffic server, and there were no pre-existing rewrites.

The additional logging is emitted into (ssl_)error_log. This gave me insight into how exactly the matching was attempted, and what the internal representations for rules and URIs are in mod_rewrite.

excerpt from ssl_error_log (many columns ommitted for brevity), with rule RewriteRule (.*)project%B2/(.*) $1project²/$2 [NE,L]

[rewrite:trace3] applying pattern '(.*)project%B2/(.*)' to uri 'project\xb2/'
[rewrite:trace1] pass through /var/www/html/example.org/project\xb2

Note that the request-uri from client is written \xb2, but my pattern uses %B2.

Matching the rule-syntax to the uri-syntax, with rule RewriteRule (.*)project\xB2/(.*) $1project²/$2 [NE,L]

[rewrite:trace3] applying pattern '(.*)project\\xb2/(.*)' to uri 'project\xb2/'
[rewrite:trace2] rewrite 'project\xb2/' -> 'project%c2%b2/'
[rewrite:trace1] internal redirect with /auth-test/project\xc2\xb2/ [INTERNAL REDIRECT]

🎉 success! 🎉 As we can see, we are now matching!


Why no [R]/[R=302] flag?

As this is a character-encoding issue, I do not think doing an extra HTTP-round-trip will add value; Every link fed into the client will run into the same issue again, unless I fix the encoding issue before feeding it into the client-side java-program.


Don't forget RewriteBase

Please note that this shortened version omits setting the correct RewriteBase, which can screw up the rewritten path, depending on where in your conf it is written (e.g. <Directory> vs <Location>). Without RewriteBase I accidentally redirected to ❌https://example.org/var/www/html/rewrite-testing/project² instead of ✅https://example.org/rewrite-testing/project²)