Blocking yandex.ru bot

Solution 1:

Don't believe what you read on forums about this! Trust what your server logs tell you. If Yandex obeyed robots.txt, you would see the evidence in your logs. I have seen for myself that Yandex robots do not even READ the robots.txt file!

Quit wasting time with long IP lists that only serve to slow down your site drastically.

Enter the following lines in .htaccess (in the root folder of each of your sites):

SetEnvIfNoCase User-Agent "^Yandex*" bad_bot
Order Deny,Allow
Deny from env=bad_bot

I did, and all Yandex gets now are 403 Access denied errors.

Good bye Yandex!

Solution 2:

I'm too young here (reputation) to post all the URLs I need to as hyperlinks, so pardon my parenthesized URLs, please.

The forum link from Dan Andreatta, and this other one, have some but not all of what you need. You'll want to use their method of finding the IP numbers, and script something to keep your lists fresh. Then you want something like this, to show you some known values including the sub-domain naming schemes they have been using. Keep a crontabbed eye on their IP ranges, maybe automate something to estimate a reasonable CIDR (I didn't find any mention of their actual allocation; could just be google fail @ me).

Find their IP range(s) as accurately as possible, so you don't have to waste time doing a reverse DNS look up while users are waiting for (http://yourdomain/notpornipromise), and instead you're only doing a comparison match or something. Google just showed me grepcidr , which looks highly relevant. From the linked page: "grepcidr can be used to filter a list of IP addresses against one or more Classless Inter-Domain Routing (CIDR) specifications, or arbitrary networks specified by an address range." I guess it is nice that its a purpose built code with known I/O, but you know that you can reproduce the function in a billion different ways.

The most, "general solution", I can think of for this and actually wish to share (speaking things into existence and all that) is for you to start writing a database of such offenders at your location(s), and spend some off-hours thinking and researching on ways to defend and counter attack the behavior. This takes you deeper into intrusion detection, pattern analysis, and honey nets, than the scope of this specific question truly warrants. However, within the scope of that research are countless answers to this question you have asked.

I found this due to Yandex's interesting behavior on one of my own sites. I wouldn't call what I see in my own log abusive, but spider50.yandex.ru consumed 2% of my visit count, and 1% of my bandwidth... I can see where the bot would be truly abusive to large files and forums and such, neither of which are available for abuse on the server I'm looking at today. What was interesting enough to warrant investigation was the bot looking at /robots.txt, then waiting 4 to 9 hours and asking for a /directory/ not in it, then waiting 4 to 9 hours, asking for /another_directory/, then maybe a few more, and /robots.txt again, repeat ad finitum. So far as frequency goes, I suppose they're well behaved enough, and the spider50.yandex.ru machine appeared to respect /robots.txt.

I'm not planning to block them from this server today, but I would if I shared Ross' experience.

For reference on the tiny numbers we're dealing with in my server's case, today:

Top 10 of 1315 Total Sites By KBytes
 # Hits  Files  KBytes   Visits  Hostname
 1 247 1.20% 247 1.26% 1990 1.64% 4 0.19% ip98-169-142-12.dc.dc.cox.net
 2 141 0.69% 140 0.72% 1873 1.54% 1 0.05% 178.160.129.173
 3 142 0.69% 140 0.72% 1352 1.11% 1 0.05% 162.136.192.1
 4 85 0.41% 59 0.30% 1145 0.94% 46 2.19% spider50.yandex.ru
 5 231 1.12% 192 0.98% 1105 0.91% 4 0.19% cpe-69-135-214-191.woh.res.rr.com
 6 16 0.08% 16 0.08% 1066 0.88% 11 0.52% rate-limited-proxy-72-14-199-198.google.com
 7 63 0.31% 50 0.26% 1017 0.84% 25 1.19% b3090791.crawl.yahoo.net
 8 144 0.70% 143 0.73% 941  0.77% 1 0.05% user10.hcc-care.com
 9 70 0.34% 70 0.36% 938  0.77% 1 0.05% cpe-075-177-135-148.nc.res.rr.com
10 205 1.00% 203 1.04% 920  0.76% 3 0.14% 92.red-83-54-7.dynamicip.rima-tde.net

That's in a shared host who doesn't even bother capping bandwidth anymore, and if the crawl took some DDoS-like form, they would probably notice and block it before I would. So, I'm not angry about that. In fact, I much prefer having the data they write in my logs to play with.

Ross, if you really are angry about the 2GB/day you're losing to Yandex, you might spampoison them. That's what it's there for! Reroute them from what you don't want them downloading, either by HTTP 301 directly to a spampoison sub-domain, or roll your own so you can control the logic and have more fun with it. That sort of solution gives you the tool to reuse later, when it's even more necessary.

Then start looking deeper in your logs for funny ones like this:

217.41.13.233 - - [31/Mar/2010:23:33:52 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:33:54 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:33:58 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:00 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:01 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:03 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:04 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:05 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:06 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:09 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:14 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:16 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:17 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:18 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:21 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:23 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:24 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:26 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:27 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:28 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"

Hint: No /user/ directory, nor a hyperlink to such, exists on the server.

Solution 3:

According to this forum, the yandex bot is well behaved and respects robots.txt.

In particular they say

The behaviour of Yandex is quite a lot like that of Google with regard to robots.txt .. The bot doesn't look at the robots.txt every single time it enters the domain.

Bots like Yandex, Baudi, and Sohu have all been fairly well behaved and as a result, are allowed. None of them have ever gone places I didn't want them to go, and parse rates don't break the bank with regard to bandwidth.

Personally I do not have issues with it, and googlebot is by far the most aggressive crawler for the sites I have.