nginx, separate robot access log and human access log

I'm trying to separate the robot access log and human access log, so I'm using below configuration:

    http {
....
    map $http_user_agent $ifbot {
        default 0;
        "~*rogerbot"        3;
        "~*ChinasoSpider"       3;
        "~*Yahoo"           1;
        "~*Bot"         1;
        "~*Spider"          1;
        "~*archive"         1;
        "~*search"          1;
        "~*Yahoo"           1;
        "~Mediapartners-Google" 1;
        "~*bingbot"         1;
        "~*YandexBot"           1;
        "~*Feedly"  2;
        "~*Superfeedr"  2;
        "~*QuiteRSS"    2;
        "~*g2reader"    2;
        "~*Digg"    2;
        "~*trendiction"     3;
        "~*AhrefsBot"           3;
        "~*curl"            3;
        "~*Ruby"            3;
        "~*Player"          3;
        "~*Go\ http\ package"   3;
        "~*Lynx"            3;
        "~*Sleuth"          3;
        "~*Python"          3;
        "~*Wget"            3;
        "~*perl"            3;
        "~*httrack"         3;
        "~*JikeSpider"          3;
        "~*PHP"         3;
        "~*WebIndex"            3;
        "~*magpie-crawler"      3;
        "~*JUC"         3;
        "~*Scrapy"          3;
        "~*libfetch"            3;
        "~*WinHTTrack"      3;
        "~*htmlparser"      3;
        "~*urllib"          3;
        "~*Zeus"            3;
        "~*scan"            3;
        "~*Indy\ Library"       3;
        "~*libwww-perl"     3;
        "~*GetRight"            3;
        "~*GetWeb!"         3;
        "~*Go!Zilla"            3;
        "~*Go-Ahead-Got-It"     3;
        "~*Download\ Demon" 3;
        "~*TurnitinBot"     3;
        "~*WebscanSpider"       3;
        "~*WebBench"        3;
        "~*YisouSpider"     3;
        "~*check_http"      3;
        "~*webmeup-crawler"     3;
        "~*omgili"      3;
        "~*blah"        3;
        "~*fountainfo"      3;
        "~*MicroMessenger"      3;
        "~*QQDownload"      3;
        "~*shoulu.jike.com"     3;
        "~*omgilibot"       3;
        "~*pyspider"        3;
    }
....
}

And in server part, I'm using:

    if ($ifbot = "1") {
    set $spiderbot 1;
}
if ($ifbot = "2") {
    set $rssbot 1;
}
if ($ifbot = "3") {
    return 403;
    access_log /web/log/badbot.log  main;
}

access_log /web/log/location_access.log  main;
    access_log /web/log/spider_access.log main if=$spiderbot;
    access_log /web/log/rssbot_access.log main if=$rssbot;

But it seems that nginx will write some robot logs in to both location_access.log and spider_access.log.

How can I separate the logs for the robot?

And another questions is that some robot logs are not written to spider_access.log but exist in location_access.log. It seems that my map is not working. Is anything wrong when I define "map"?


Solution 1:

Working solution, without any other process involved:

Inspired from the comments. You can adapt it easily to several kinds of bots (bad/good ones) and put the return 403; statement in the right part. The idea is following:

In the http part:

map $http_user_agent $bot {
    default "";
    "~*Googlebot"   "yes";
    "~*MJ12bot"     "yes";
    # Add as many as desired
}
map $bot $no_bot {
    default "no";
    "yes"   "";
}

Then, in the server part:

access_log   /var/log/regular_access.log main if=$no_bot;
access_log   /var/log/bots_access.log main if=$bot;

This works but is not really nice when you want to use nginx as reverse proxy and redirect to several web servers. (Not very flexible way to define the names of the the logfiles).

Better looking but not working

I would have liked to use this solution:

http part:

map $http_user_agent $bot_header {
    default "";
    "~*Googlebot"   "bots_";
    "~*MJ12bot"     "bots_";
    # Add as many as desired
}

map $server_name $log_filename {
    default          "unknown";
    "site1....."     "site1_***.log";
    "site2....."     "site2_***.log";
}

And then, in each server part:

server { # simple reverse-proxy...
        listen       37........:80;
        server_name  dev.****.net;
        access_log   /var/log/nginx/access/$bot_header$log_filename  main;

        # pass all requests
        location / {
                    # There, your config
        }
  }

But this second one doesn't work. Even if it's the right path to the right file, with the correct rights on it, nginx records an error saying its rights are not sufficient. Funny part is, this error is logged into a file having exactly the same owners and rights as the one where it can't write. No idea why, or whether it's a bug? Maybe someone can try and fix the problem?