Blocking by user-agent string in httpd.conf not effective
I'd like to block some spiders and bad bots by user-agent text string for all of my virtual hosts via httpd.conf but have yet to find success. Below are the contents of my http.conf file. Any ideas why this isn't working? env_module is loaded.
SetEnvIfNoCase User-Agent "^BaiDuSpider" UnwantedRobot
SetEnvIfNoCase User-Agent "^Yandex" UnwantedRobot
SetEnvIfNoCase User-Agent "^Exabot" UnwantedRobot
SetEnvIfNoCase User-Agent "^Cityreview" UnwantedRobot
SetEnvIfNoCase User-Agent "^Dotbot" UnwantedRobot
SetEnvIfNoCase User-Agent "^Sogou" UnwantedRobot
SetEnvIfNoCase User-Agent "^Sosospider" UnwantedRobot
SetEnvIfNoCase User-Agent "^Twiceler" UnwantedRobot
SetEnvIfNoCase User-Agent "^Java" UnwantedRobot
SetEnvIfNoCase User-Agent "^YandexBot" UnwantedRobot
SetEnvIfNoCase User-Agent "^bot*" UnwantedRobot
SetEnvIfNoCase User-Agent "^spider" UnwantedRobot
SetEnvIfNoCase User-Agent "^crawl" UnwantedRobot
SetEnvIfNoCase User-Agent "^NG\ 1.x (Exalead)" UnwantedRobot
SetEnvIfNoCase User-Agent "^MJ12bot" UnwantedRobot
<Directory "/var/www/">
Order Allow,Deny
Allow from all
Deny from env=UnwantedRobot
</Directory>
<Directory "/srv/www/">
Order Allow,Deny
Allow from all
Deny from env=UnwantedRobot
</Directory>
EDIT - @Shane Madden: I do have .htaccess files in each virtual host's document root with the following.
order allow,deny
deny from xxx.xxx.xxx.xxx
deny from xx.xxx.xx.xx
deny from xx.xxx.xx.xxx
...
allow from all
Could that be creating conflict? Sample VirtualHost config:
<VirtualHost xx.xxx.xx.xxx:80>
ServerAdmin [email protected]
ServerName domain.com
ServerAlias www.domain.com
DocumentRoot /srv/www/domain.com/public_html/
ErrorLog "|/usr/bin/cronolog /srv/www/domain.com/logs/error_log_%Y-%m"
CustomLog "|/usr/bin/cronolog /srv/www/domain.com/logs/access_log_%Y-%m" combined
</VirtualHost>
Try this, and if it fails, try it in a .htaccess file...
#Bad bot removal
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^useragent1 [OR]
RewriteCond %{HTTP_USER_AGENT} ^useragent2 [OR]
RewriteCond %{HTTP_USER_AGENT} ^useragent3
RewriteRule ^(.*)$ http://website-you-want-to-send-bad-bots-to.com
Follow this pattern, and don't put an [OR] on the very last one.
EDIT: New solution:
If you want to block all (friendly) bots, make a file called "robots.txt" and put it in where your index.html is. Inside it, put this:
User-agent: *
Disallow: /
You'd still need to maintain a list like my original answer (above) to disallow the bots that ignore robots.txt.