Python cloudscraper requests slow, with 403 responses

I am using Cloduscraper Python library in order to obtain a JSON response from an url. The probem is that I have to retry the same request 2-3 times before I get the correct output. The first responses have a 403 HTTP status code.

Here is my code:

import json
from time import sleep
import cloudscraper

url = "https://www.endpoint.com/api/"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0",
    "Accept": "*/*",
    "Content-Type": "application/json"
}
json_response = 0
while json_response == 0:
    try:
        scraper = cloudscraper.create_scraper()
        r = scraper.get(url, headers=headers)
        json_response = json.loads(r.text)
    except:
        print(r.status_code)
        sleep(2)
return json_response

What can I do in order to optimize my code and prevent the 403 responses?


Solution 1:

Try https://rapidapi.com/restyler/api/scrapeninja - it emulates chrome fingerprint and seems to work fine with the website you've mentioned.

Solution 2:

You could use real browser to prevent some part of bot detection, here is the example with playwright:

import json

from playwright.sync_api import sync_playwright

API_URL = 'https://www.soraredata.com/api/players/info/29301348132354218386476497174231278066977835432352170109275714645119105189666'

with sync_playwright() as p:
    # Webkit is fastest to start and hardest to detect
    browser = p.webkit.launch(headless=True)

    page = browser.new_page()
    page.goto(API_URL)

    # Use evaluate instead of `content` not to import bs4 or lxml
    html = page.evaluate('document.querySelector("pre").innerText')

try:
    data = json.loads(html)
except:
    # Still might fail sometimes
    data = None

print(data)