Python cloudscraper requests slow, with 403 responses
I am using Cloduscraper Python library in order to obtain a JSON response from an url. The probem is that I have to retry the same request 2-3 times before I get the correct output. The first responses have a 403 HTTP status code.
Here is my code:
import json
from time import sleep
import cloudscraper
url = "https://www.endpoint.com/api/"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0",
"Accept": "*/*",
"Content-Type": "application/json"
}
json_response = 0
while json_response == 0:
try:
scraper = cloudscraper.create_scraper()
r = scraper.get(url, headers=headers)
json_response = json.loads(r.text)
except:
print(r.status_code)
sleep(2)
return json_response
What can I do in order to optimize my code and prevent the 403 responses?
Solution 1:
Try https://rapidapi.com/restyler/api/scrapeninja - it emulates chrome fingerprint and seems to work fine with the website you've mentioned.
Solution 2:
You could use real browser to prevent some part of bot detection, here is the example with playwright
:
import json
from playwright.sync_api import sync_playwright
API_URL = 'https://www.soraredata.com/api/players/info/29301348132354218386476497174231278066977835432352170109275714645119105189666'
with sync_playwright() as p:
# Webkit is fastest to start and hardest to detect
browser = p.webkit.launch(headless=True)
page = browser.new_page()
page.goto(API_URL)
# Use evaluate instead of `content` not to import bs4 or lxml
html = page.evaluate('document.querySelector("pre").innerText')
try:
data = json.loads(html)
except:
# Still might fail sometimes
data = None
print(data)