how to handle 302 redirect in scrapy

I am receiving a 302 response from a server while scrapping a website:

2014-04-01 21:31:51+0200 [ahrefs-h] DEBUG: Redirecting (302) to <GET http://www.domain.com/Site_Abuse/DeadEnd.htm> from <GET http://domain.com/wps/showmodel.asp?Type=15&make=damc&a=664&b=51&c=0>

I want to send request to GET urls instead of being redirected. Now I found this middleware:

https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/downloadermiddleware/redirect.py#L31

I added this redirect code to my middleware.py file and I added this into settings.py:

DOWNLOADER_MIDDLEWARES = {
 'street.middlewares.RandomUserAgentMiddleware': 400,
 'street.middlewares.RedirectMiddleware': 100,
 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}

But I am still getting redirected. Is that all I have to do in order to get this middleware working? Do I miss something?

Forgot about middlewares in this scenario, this will do the trick:

meta = {'dont_redirect': True,'handle_httpstatus_list': [302]}

That said, you will need to include meta parameter when you yield your request:

yield Request(item['link'],meta = {
                  'dont_redirect': True,
                  'handle_httpstatus_list': [302]
              }, callback=self.your_callback)

An unexplicable 302 response, such as redirecting from a page that loads fine in a web browser to the home page or some fixed page, usually indicates a server-side measure against undesired activity.

You must either reduce your crawl rate or use a smart proxy (e.g. Crawlera) or a proxy-rotation service and retry your requests when you get such a response.

To retry such a response, add 'handle_httpstatus_list': [302] to the meta of the source request, and check if response.status == 302 in the callback. If it is, retry your request by yielding response.request.replace(dont_filter=True).

When retrying, you should also make your code limit the maximum number of retries of any given URL. You could keep a dictionary to track retries:

class MySpider(Spider):
    name = 'my_spider'

    max_retries = 2

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.retries = {}

    def start_requests(self):
        yield Request(
            'https://example.com',
            callback=self.parse,
            meta={
                'handle_httpstatus_list': [302],
            },
        )

    def parse(self, response):
        if response.status == 302:
            retries = self.retries.setdefault(response.url, 0)
            if retries < self.max_retries:
                self.retries[response.url] += 1
                yield response.request.replace(dont_filter=True)
            else:
                self.logger.error('%s still returns 302 responses after %s retries',
                                  response.url, retries)
            return

Depending on the scenario, you might want to move this code to a downloader middleware.

You can disable the RedirectMiddleware by setting REDIRECT_ENABLED to False in settings.py

how to handle 302 redirect in scrapy

Related

Recent Posts