Apache rewrite causes server error 403 when genuine directory exists after extension removal rewrite

I've spent a couple of days now trying to create a specific rule set that will allow me to remove .html extensions from all the files in the directory and present neater URIs. I'm using a .htaccess file in the root directory of that website and the plan is to use this across a number of sites that will have the same issues.

I've been through many iterations of similar config but the closest I have found is actually stripped directly from a post on here (which I sadly wasn't able to comment on to find out more). So the below is what I currently have:

RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{DOCUMENT_ROOT}%{REQUEST_URI}.html -f
RewriteRule (.*) $1.html [L]

ErrorDocument 404 https://example.com/404

It's simple and works really well for the most part, but when a genuine directory exists then it seems to throw a 403 server error.

For example, if I visited example.com/directory_A - I get a 403 error. However, there is actually a file in the root directory with the same name so I expect it to actually present example.com/directory_A.html (but without the html of course). In the directory_A directory there is a file, file_B.html, and visiting example.com/directory_A/file_B presents the file_B.html content as expected.

I'm going round in circles with this - this is definitely the closest I have come to solving my problem but I just don't know enough to get me over this last hurdle so any help here would be greatly appreciated.


but when a genuine directory exists then it seems to throw a 403 server error.

The 403 is not caused by the rule you posted. The first condition specifically excludes directories anyway, so it isn't even processed.

The 403 is caused by mod_dir trying to serve a DirectoryIndex document (eg. index.html) from the /directory_A/ subdirectory - which presumably does not exist.

Specifically, when you request /directory_A (without a trailing slash) mod_dir will "fix" the URL by appending a trailing slash via a 301 (permanent) redirect. Then, on the redirected request, mod_dir tries to serve the directory index from that directory and triggers a 403 if it does not exist and directory listings are disabled (mod_autoindex).

To do as you require you need to prevent mod_dir from appending the trailing slash on physical directories with the DirectorySlash Off directive. Then, in order to serve /directory_A.html (instead of passing through the /directory_A request) you need to remove the first condition that excludes requests for directories.

For example:

# Ensure that directory listings are disabled
Options -Indexes

# Prevent mod_dir appending a slash to physical directories
DirectorySlash Off

# Rewrite request to append ".html" extension if it exists
RewriteCond %{DOCUMENT_ROOT}/$1.html -f
RewriteRule (.*) $1.html [L]

Note that directory listings must be disabled if you are setting DirectorySlash Off otherwise mod_autoindex will generate a directory listing when requesting a directory without a trailing slash and the corresponding .html file does not exist. Note the security warning in the Apache docs regarding the DirectorySlash directive.

In the RewriteCond directive I changed the use of REQUEST_URI to use the backreference from the RewriteRule pattern instead to be consistent - to ensure you always use the same value in both the RewriteCond TestString and the RewriteRule substitution.

Note that requesting /directory_A/ (with a trailing slash) will still result in a 403 response, but this is expected unless you specifically want to handle this edge case and route requests to /directory_A.html instead? UPDATE: This is best achieved by implementing an external redirect to simply remove the trailing slash from the URL when a corresponding .html file exists, so the rewrite (above) then does its thing and appends the .html extension on the redirect response. This ensures you have a single canonical URL, avoiding a potential duplicate content issue (where /directory_A and /directory_A/ both return the same resource).

For example, add the following "redirect" rule immediately before the above "rewrite" rule:

# Remove trailing slash on URL-path when the corresponding ".html" file exists
RewriteCond %{DOCUMENT_ROOT}/$1.html -f
RewriteRule (.*)/$ /$1 [R=302,L]

This doesn't explicitly check for the directory, so it will also work on other "files" as well. eg. /directory_A/file_B/ will be redirected to /directory_A/file_B (trailing slash removed).

Test first with a 302 (temporary) redirect and only change to a 301 (permanent) redirect when you are sure it's working as intended in order to avoid potential caching issues.

You will need to make sure the browser cache is cleared before testing, since the earlier 301 that mod_dir triggered to append the trailing slash on the directory will have been cached by the browser.

TBH, it is better to avoid such conflicts to begin with and not have files with the same basename as physical directories when implementing "extensionless" URLs.


Aside:

Optimisation

Your directive that appends the .html extension could be optimised, since it is currently testing every request for the existence of a file with .html on the end (which is relatively expensive and probably unnecessary). eg. request /images/myimage.jpg and your rule will check for the existence of /images/myimage.jpg.html on the filesystem. You could avoid these unnecessary checks by excluding requests that already include a file extension (assuming your URLs don't intentionally have dots near the end of the URL-path that looks like a file-extension).

For example:

# Rewrite request to append ".html" extension if it exists
RewriteCond $1 !\.\w{2,4}$
RewriteCond %{DOCUMENT_ROOT}/$1.html -f
RewriteRule (.*) $1.html [L]

ErrorDocument

ErrorDocument 404 https://example.com/404

This directive is arguably incorrect.

  1. When you specify an absolute URL it will trigger a 302 (temporary) redirect for the error document, not an internal subrequest as it should be. Consequently the client does not see the 404 HTTP status unless you manually set this in the redirect response. But either way, the client sees a 302 first.

  2. You should specify the actual URL of the 404 error document here, not the "extensionless" version (which requires additional processing to happen), as you appear to be doing here. This is entirely internal to you server, the client does not see this URL.

For example:

ErrorDocument 404 /404.html

Although it's often preferable to have your error documents in a separate subdirectory that is trivial to exclude from other redirects/rewrites. eg. /errordocs/404.html.