Scrapy - how to manage cookies/sessions

Solution 1:

Three years later, I think this is exactly what you were looking for: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:reqmeta-cookiejar

Just use something like this in your spider's start_requests method:

for i, url in enumerate(urls):
    yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},
        callback=self.parse_page)

And remember that for subsequent requests, you need to explicitly reattach the cookiejar each time:

def parse_page(self, response):
    # do some processing
    return scrapy.Request("http://www.example.com/otherpage",
        meta={'cookiejar': response.meta['cookiejar']},
        callback=self.parse_other_page)

Solution 2:

from scrapy.http.cookies import CookieJar
...

class Spider(BaseSpider):
    def parse(self, response):
        '''Parse category page, extract subcategories links.'''

        hxs = HtmlXPathSelector(response)
        subcategories = hxs.select(".../@href")
        for subcategorySearchLink in subcategories:
            subcategorySearchLink = urlparse.urljoin(response.url, subcategorySearchLink)
            self.log('Found subcategory link: ' + subcategorySearchLink), log.DEBUG)
            yield Request(subcategorySearchLink, callback = self.extractItemLinks,
                          meta = {'dont_merge_cookies': True})
            '''Use dont_merge_cookies to force site generate new PHPSESSID cookie.
            This is needed because the site uses sessions to remember the search parameters.'''

    def extractItemLinks(self, response):
        '''Extract item links from subcategory page and go to next page.'''
        hxs = HtmlXPathSelector(response)
        for itemLink in hxs.select(".../a/@href"):
            itemLink = urlparse.urljoin(response.url, itemLink)
            print 'Requesting item page %s' % itemLink
            yield Request(...)

        nextPageLink = self.getFirst(".../@href", hxs)
        if nextPageLink:
            nextPageLink = urlparse.urljoin(response.url, nextPageLink)
            self.log('\nGoing to next search page: ' + nextPageLink + '\n', log.DEBUG)
            cookieJar = response.meta.setdefault('cookie_jar', CookieJar())
            cookieJar.extract_cookies(response, response.request)
            request = Request(nextPageLink, callback = self.extractItemLinks,
                          meta = {'dont_merge_cookies': True, 'cookie_jar': cookieJar})
            cookieJar.add_cookie_header(request) # apply Set-Cookie ourselves
            yield request
        else:
            self.log('Whole subcategory scraped.', log.DEBUG)

Solution 3:

I think the simplest approach would be to run multiple instances of the same spider using the search query as a spider argument (that would be received in the constructor), in order to reuse the cookies management feature of Scrapy. So you'll have multiple spider instances, each one crawling one specific search query and its results. But you need to run the spiders yourself with:

scrapy crawl myspider -a search_query=something

Or you can use Scrapyd for running all the spiders through the JSON API.

Calculating the correlated equilibrium of a $3\times 3$ normal form game using MatLab

Find the volume of the solid bounded by $z^2=xy$; $x+y=a$; $x+y=b$ $(0<a<b)$ by applying variable substitution

Definition of convergent sequence in topological space

Python: download files from google drive using url

Sorgenfrey line is normal

Evaluating an improper integral using the Residue Theorem

Uniqueness of topology and smooth structure on an immersed submanifold

What is the topology in $\Bbb R^3$ generated by planes in $\Bbb R^3$?

Showing that two circles in w-plane tangent each other if there is a Möbius transformation that takes them to two parallel lines

Proving $n! = (-1)^n \sum_{i=1}^n (-1)^i\binom{n}{i} i^n $ [closed]

For integers $x<y<z$, why are these cases impossible for Mengoli's Six-Square Problem?

finding X for density function

Scrapy - how to manage cookies/sessions

Solution 1:

Solution 2:

Solution 3:

Related

Recent Posts