Why my program to scrape NSE website gets blocked in servers but works in local?

Solution 1:

I stumbled into the same problem. I do not know the proper pythonic solution with the python-requests module. There is a high chance NSE just blocks it.

So here is a pythonic solution that will work. It looks lame but I'm using it without digging deep -

import subprocess
import os
os.chdir(os.path.dirname(os.path.abspath(__file__)))

subprocess.Popen('curl "https://www.nseindia.com/api/quote-derivative?symbol=BANKNIFTY" -H "authority: beta.nseindia.com" -H "cache-control: max-age=0" -H "dnt: 1" -H "upgrade-insecure-requests: 1" -H "user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36" -H "sec-fetch-user: ?1" -H "accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" -H "sec-fetch-site: none" -H "sec-fetch-mode: navigate" -H "accept-encoding: gzip, deflate, br" -H "accept-language: en-US,en;q=0.9,hi;q=0.8" --compressed  -o maxpain.txt', shell=True)

f=open("maxpain.txt","r")
var=f.read()
print(var)

It basically runs the curl function and sends the output to a file and read the file back. That's it.

Solution 2:

There are 2 things that are to be noted.

Request header needs to have 'host' and 'user-agent'

__request_headers = {
        'Host':'www.nseindia.com', 
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0',
        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 
        'Accept-Language':'en-US,en;q=0.5', 
        'Accept-Encoding':'gzip, deflate, br',
        'DNT':'1', 
        'Connection':'keep-alive', 
        'Upgrade-Insecure-Requests':'1',
        'Pragma':'no-cache',
        'Cache-Control':'no-cache',    
    }

Following cookies are dynamically set, which needs to be fetched and set dynamically.

'nsit',
'nseappid',
'ak_bmsc'

These are set from nse based on the functionality that is being used. This example: top gainers / losers. I tried to get top gainers and losers list, in which the request is blocked without these cookies.

try:
            nse_url = 'https://www.nseindia.com/market-data/top-gainers-loosers'
            url = 'https://www.nseindia.com/api/live-analysis-variations?index=gainers'
            resp = requests.get(url=nse_url, headers=__request_headers)
            if resp.ok:
                req_cookies = dict(nsit=resp.cookies['nsit'], nseappid=resp.cookies['nseappid'], ak_bmsc=resp.cookies['ak_bmsc'])
                tresp = requests.get(url=url, headers=__request_headers, cookies=req_cookies)
                result = tresp.json()
                res_data = result["NIFTY"]["data"] if "NIFTY" in result and "data" in result["NIFTY"] else []
                if res_data != None and len(res_data) > 0:
                    __top_list = res_data
        except OSError as err:
            logger.error('Unable to fetch data')

Another thing to be noted is that these requests are blocked by NSE from most of the cloud VMs like AWS, GCP. I was able to get it from personal windows machine, but not from AWS or GCP.

Why my program to scrape NSE website gets blocked in servers but works in local?

Solution 1:

Solution 2:

Related

Recent Posts