Remove Characters from URL with htaccess

Hopefully someone can see what I'm doing wrong, but here's the story...

My current site URL's are auto-generated by the ecommerce software from the product and category names, therefore if the product/category name includes a non-alphanumeric characer, this is encoded in the URL which is a pain. EG:

mysite.com/Shop/Furniture-Set-Large-Table%2C-4-Chairs.html

I am moving to a new ecommerce solution, which also autogenerates the URL's from the product name, but is clever enough to remove all non-alphanumeric characters. It also converts to lowercase, which I have managed to find a htaccess solution for redirecting uppercase to lowercase. It also does not have the 'Shop' part of the URL, which I have also managed to solve via htaccess. EG:

mysite.com/furniture-set-large-table-4-chairs.html

To remove the 'Shop' part:

RedirectMatch 301 ^/Shop/(.*)$ http://www.mysite.com/$1

To replace uppercase with lowercase to prevent a 404 error:

RewriteCond %{REQUEST_URI} [A-Z]
RewriteCond %{REQUEST_FILENAME} !\.(?:png|gif|ico|swf|jpg|jpeg|js|css|php|pdf)$
RewriteRule (.*) ${lc:http://www.mysite.com/$1} [R=301,L]

These both work perfectly.

So I need an htaccess rule, or possibly several, to remove these encoded characters from the URL. I don't need to replace them, just remove them, because the software creates the URL as "Table%2C-4-Chairs" - so only the %2C needs removed.

I need to remove certain character encodings from the URL, such as:

comma (%2C), apostrophe (%27), colon (%3A), etc.

Can anyone advise a suitable htaccess rule or rules for this?

Thanks in advance.


Solution 1:

The URI is url-decoded before it's sent through the rewrite engine, so you want to match the actual characters and not their encoded counterparts:

RewriteRule ^(.*),(.*)$ /$1$2 [L]
RewriteRule ^(.*):(.*)$ /$1$2 [L]
RewriteRule ^(.*)\'(.*)$ /$1$2 [L]
RewriteRule ^(.*)\"(.*)$ /$1$2 [L]
# etc...

RewriteCond %{ENV:REDIRECT_STATUS} 200
RewriteRule ^(.*)$ http://www.mysite.com/$1 [L,R=301]

The redirect status lets mod rewrite know that if any of the above rules got applied (thus making the internal redirect status value = 200) then we need to redirect, but we won't reach that part of the rules until it's cleared all of the special character checks.

You'd want these rules all before any of the redirects so that the rules can loop and remove multiple instances of any of those characters. Then, once there are no more special characters, the rewrite engine can trickle down to where your redirects are.

I'd suggest that you remove the mod_alias RedirectMatch directive and replace it with a rewrite rule. Sometimes combining the 2 modules and having both of them affect a single URI can lead to unexpected results. so before all of the above rules, you'd have:

RewriteRule ^Shop/(.*)$ /$1 [L]

adding the removal of /Shop/ in the chain of special characters. Then your last rule would follow:

RewriteCond %{REQUEST_URI} [A-Z]
RewriteCond %{REQUEST_FILENAME} !\.(?:png|gif|ico|swf|jpg|jpeg|js|css|php|pdf)$
RewriteRule (.*) ${lc:http://www.mysite.com/$1} [R=301,L]