How to prevent bots from trying to guess a link on my site
My logwatch report which I installed recently shows me this:
--------------------- httpd Begin ------------------------
0.78 MB transferred in 5864 responses (1xx 0, 2xx 4900, 3xx 0, 4xx 964, 5xx 0)
160 Images (0.16 MB),
857 Content pages (0.62 MB),
4847 Other (0.00 MB)
Requests with error response codes
404 Not Found
/%E2%80%98planeat%E2%80%99-film-explores-l ... greenfudge-org/: 1 Time(s)
/10-foods-to-add-to-the-brain-diet-to-help ... -function/feed/: 1 Time(s)
/10-ways-to-reboot-your-body-with-healthy- ... s-and-exercise/: 1 Time(s)
/bachmann-holds-her-ground-against-raising ... com-blogs/feed/: 1 Time(s)
/behind-conan-the-barbarians-diet/: 1 Time(s)
/tag/dietitian/: 1 Time(s)
/tag/diets/page/10/: 1 Time(s)
/tag/directory-products/feed/: 1 Time(s)
/wp-content/uploads/2011/06/1309268736-49.jpg: 1 Time(s)
/wp-content/uploads/2011/06/1309271430-30.jpg: 1 Time(s)
/wp-content/uploads/2011/06/1309339847-35.jpg: 1 Time(s)
my note here: there are really a lot of these kind of requests like above and I pasted just a few because of clarity.
A total of 12 ROBOTS were logged
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 2 Time(s)
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) 5 Time(s)
Twitterbot/1.0 1 Time(s)
Mozilla/5.0 (compatible; AhrefsBot/2.0; +http://ahrefs.com/robot/) 4 Time(s)
Sosospider+(+http://help.soso.com/webspider.htm) 3 Time(s)
msnbot/2.0b (+http://search.msn.com/msnbot.htm)._ 1 Time(s)
Mozilla/5.0 (compatible; MJ12bot/v1.4.2; http://www.majestic12.co.uk/bot.php?+) 1 Time(s)
msnbot-media/1.1 (+http://search.msn.com/msnbot.htm) 77 Time(s)
Mozilla/5.0 (compatible; Ezooms/1.0; [email protected]) 1 Time(s)
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) 17 Time(s)
Baiduspider+(+http://www.baidu.com/search/spider.htm) 11 Time(s)
Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly/) Gecko/2009032608 Firefox/3.0.8 1 Time(s)
---------------------- httpd End -------------------------
So, I'm thinking this is somekind of a bot (and potentialy one of the listed ones above), so can you please direct me on how could I prevent them from guessing the links in hope of finding content?
edit: since i own a VPS server, there are a lot of domains on it. Can you tell me how can I know on which domain particular 404 happened? Like this line for example: /tag/dietitian/
Solution 1:
You don't, really, anymore than you can stop regular users guessing at links. Correctly secure your content and this won't be an issue anyway.
Obscure links are not a safe way to hide things.
You can ensure you've got a correctly configured robots.txt - that'll stop most of the legit bots.
Solution 2:
One way would be to use fail2ban and configure it to match your needs. In short: among the other its features, fail2ban can tail your Apache access log and after X amount of Y kind of matches can put the accessing client to a Z minutes penalty by blocking the client IP for XX minutes.
Usually enough to scare the bots away, but beware, this can very well block legitimate users if you are not careful enough.
Solution 3:
Search engine crawlers don't guess links - they just follow them unless dissuaded by a nofollow or robots.txt rule.
If you have requests for things that don't exist from a search engine's bot - the crawler is following a link on a publicly accessible page that points at it, the correct action is to correct/remove the reference.
If it's a malicious bot - all you can do is detect it and block access. If the bot is announcing itself - that's easy you could for example block with a rewrite rule