How to identity who is scraping my website?

I have an e-commerce website, hosted on AWS.

I understand there are tools that prevent/block the scraping bots. But is it possible to detect who is scraping my website? I mean, would I be able to detect the requests are coming from a bot, then find the IP of the bot and use it to identify the server which is scraping my website?


Solution 1:

The honorable bot/ webscraper will identify itself with a User-Agent header ( AND honor a robots.txt if you want to direct its behavior) making it easy to identify.

A malicious bot (that is not requesting and honoring your robots.txt) may still identify itself with a User-agent header allowing you to identify it and then you can create and enforce server-side policies to attempt to control its behavior. When uses an User-Agent string that is one identical to a real webbrowser you can’t use that to identify it. Then it may be quite hard to distinguish requests from a bot to those made by real users.

Once you have determined which requests come from a bot, your logs will also contain the IP-address that was the source of the request.

When you can’t readily identify requests as those coming from a bot, keep in mind that you typically make your web content public and you want it to found and accessed. If your server can’t handle the requests coming from a bot you have bigger problems as it also won’t be able to handle a reasonable number of concurrent real visitors either.