I'm going to block all bots except the big search engines. One of my blocking methods will be to check for "language": Accept-Language: If it has no Accept-Language the bot's IP address will be blocked until 2037. Googlebot does not have Accept-Language, I want to verify it with DNS lookup

<?php
gethostbyaddr($_SERVER['REMOTE_ADDR']);
?>

Is it ok to use gethostbyaddr, can someone pass my "gethostbyaddr protection"?


Solution 1:

function detectSearchBot($ip, $agent, &$hostname)
{
    $hostname = $ip;

    // check HTTP_USER_AGENT what not to touch gethostbyaddr in vain
    if (preg_match('/(?:google|yandex)bot/iu', $agent)) {
        // success - return host, fail - return ip or false
        $hostname = gethostbyaddr($ip);

        // https://support.google.com/webmasters/answer/80553
        if ($hostname !== false && $hostname != $ip) {
            // detect google and yandex search bots
            if (preg_match('/\.((?:google(?:bot)?|yandex)\.(?:com|ru))$/iu', $hostname)) {
                // success - return ip, fail - return hostname
                $ip = gethostbyname($hostname);

                if ($ip != $hostname) {
                    return true;
                }
            }
        }
    }

    return false;
}

In my project, I use this function to identify Google and Yandex search bots.

The result of the detectSearchBot function is caching.

The algorithm is based on Google’s recommendation - https://support.google.com/webmasters/answer/80553

Solution 2:

In addition to Cristian's answer:

function is_valid_google_ip($ip) {
    
    $hostname = gethostbyaddr($ip); //"crawl-66-249-66-1.googlebot.com"
    
    return preg_match('/\.googlebot|google\.com$/i', $hostname);
}

function is_valid_google_request($ip=null,$agent=null){
    
    if(is_null($ip)){
        
        $ip=$_SERVER['REMOTE_ADDR'];
    }
    
    if(is_null($agent)){
        
        $agent=$_SERVER['HTTP_USER_AGENT'];
    }
    
    $is_valid_request=false;

    if (strpos($agent, 'Google')!==false && is_valid_google_ip($ip)){
        
        $is_valid_request=true;
    }
    
    return $is_valid_request;
}

Note

Sometimes when using $_SERVER['HTTP_X_FORWARDED_FOR'] OR $_SERVER['REMOTE_ADDR'] more than 1 IP address is returned, for example '155.240.132.261, 196.250.25.120'. When this string is passed as an argument for gethostbyaddr() PHP gives the following error:

Warning: Address is not a valid IPv4 or IPv6 address in...

To work around this I use the following code to extract the first IP address from the string and discard the rest. (If you wish to use the other IPs they will be in the other elements of the $ips array).

if (strstr($remoteIP, ', ')) {
    $ips = explode(', ', $remoteIP);
    $remoteIP = $ips[0];
}

https://www.php.net/manual/en/function.gethostbyaddr.php

Solution 3:

//The function
function is_google() {
    return strpos($_SERVER['HTTP_USER_AGENT'],"Googlebot");
}

Solution 4:

The recommended way by Google is to do a reverse dns lookup (gethostbyaddr) in order to get the associated host name AND then resolve that name to an IP (gethostbyname) and compare it to the remote_addr (because reverse lookups can be faked, too).

But beware, end lokups take time and can severely slow down your webpage (maybe check for user agent first).

See https://webmasters.googleblog.com/2006/09/how-to-verify-googlebot.html

Solution 5:

How to verify Googlebot.