How to avoid emails sent to Google's deep web crawler
My website has an area restricted to users who sign up with a valid email. I have got requests with bogus emails, and I want to avoid sending emails to non-existent addresses lest they increase the bounce rate and hurt my sending reputation.
The emails are:
[email protected]
[email protected]
kWQcHVzn%40ypEcDvh.NwB
The last one has %40
, the HTML entity for @
. The emails are truncations of the same character sequence.
Inspecting IP address of the requests with reverse DNS, all three requests come from cache.google.com
. If the requests come from Google's crawler, I would expect these email addresses to be documented, but I could not find any reference.
In case it is the Google crawler, I want it to index the website while avoiding send email addresses to bogus addresses. I have already implemented filtering on the address looking for that character sequence.
Is there a list of bogus addresses that deep web crawlers use to gain access and index hidden pages?
Update
Following the answer and the comment pointing at verifying that Googlebot is the crawler, I confirmed that it is not:
$ host 212.113.167.197
197.167.113.212.in-addr.arpa domain name pointer cache.google.com.
$ host cache.google.com
Host cache.google.com not found: 3(NXDOMAIN)
So indeed, it seems a malicious user, which explains why that email address is not documented as coming from Google.
Inspecting IP address of the requests with reverse DNS, all three requests come from
cache.google.com
.
When doing a reverse lookup, do not forget to check if a forward lookup of the host name points to the IP-address you are investigating.
> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.
> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1
When the reverse and forward DNS records align you, like in this example, then can trust it. Otherwise you may have a sloppy administrator or an example of an attempt by an attacker to hide their origin.
Please use a Whois query on the IP-address rather a reverse DNS lookup to determine the owner when investigating abuse.
Whatever the reverse DNS record of especially an attackers IP-address resolves to is not always reliable information.
Note that the owner of an IP-address range can set any value they want on reverse DNS records. There is no limitation that they can only use host names that they own, nor is there any inherent technical limitation that a reverse DNS record must match a forward DNS record.
(Although most diligent providers do try to enforce that when they allow their customers to set up custom reverse DNS records on the public IP-address they use. )
Setting up a fake reverse DNS record is a trick from the arsenal some attackers can use to hide their tracks and/or to appear more benign when attempting to circumvent access controls.