How to complete geetest (captcha) when scraping, by python-requests, while request values are taken by solving captcha manually?

I'm trying to scrape website, which use datadome and after some requests I have to complete geetest (slider captcha puzzle).

Here is a sample link to it: captcha link

I've decided to don't use selenium (at least for now) and I'm trying to solve my problem by python module: Requests. My idea was to complete geetest by myself then send the same request in my program, that my web browser is sending after completing that slider.

  1. At the beginning, I've scraped html code which I got on website after captcha prompt:

    <head><title>allegro.pl</title><style>#cmsg{animation: A 1.5s;}@keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style></head><body style="margin:0"><p id="cmsg">Please enable JS and disable any ad blocker</p><script>var dd={'cid':'AHrlqAAAAAMAsB0jkXrMsMAsis8SQ==','hsh':'77DC0FFBAA0B77570F6B414F8E5BDB','t':'fe','s':29701,'host':'geo.captcha-delivery.com'}</script><script src="https://ct.captcha-delivery.com/c.js"></script></body></html>
    
  2. I couldn't access iframe where most important info is, but I found out that link to to that iframe can be build with info from that html code above. As u can see in link above: cid is initialCid, hsh is hash etc., one part of the link, cid is a cookie that I got at the moment when captcha appeared.

  3. I've seen there are available services which can solve captcha for u, so I've decided to complete captcha for myself, then send exact request, including cookies and headers, to my program then send request in my program by requests. For now I'm doing it by hand, but it doesn't work. Response is 403, when manually it's 200 and redirect.

Here is a sample request that my browser is sending after completing captcha:

sample request

I'm sending it in program by:

s = requests.Session()
s.headers = headers
s.cookies.set(cookie_from_web_browser)
captcha = s.get(request)

Response is 403 and I have no idea how to make it work, help me.


Solution 1:

Captcha's are really tricky in the web scraping world, most of the time you can bypass this by solving the captcha and then manually taking the returned source's cookie and plugging it into your script. Depending on the website the cookie could hold for 15minutes, a day, or even longer.

The other alternative is to use captcha solving services such as https://www.scraperapi.com/ where you would have to pay a fee for x amount of requests but you won't run into the captcha issue as they solve them for you