Using Scrapy to to find and download pdf files from a website

The spider logic seems incorrect.

I had a quick look at your website, and seems there are several types of pages:

http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html the initial page
Webpages for specific articles, e.g. http://www.pwc.com/us/en/tax-services/publications/insights/australia-introduces-new-foreign-resident-cgt-withholding-regime.html which could be navigated from page #1
Actual PDF locations, e.g. http://www.pwc.com/us/en/state-local-tax/newsletters/salt-insights/assets/pwc-wotc-precertification-period-extended-to-june-29.pdf which could be navigated from page #2

Thus the correct logic looks like: get the #1 page first, get #2 pages then, and we could download those #3 pages.
However your spider tries to extract links to #3 pages directly from the #1 page.

EDITED:

I have updated your code, and here's something that actually works:

import urlparse
import scrapy

from scrapy.http import Request

class pwc_tax(scrapy.Spider):
    name = "pwc_tax"

    allowed_domains = ["www.pwc.com"]
    start_urls = ["http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"]

    def parse(self, response):
        for href in response.css('div#all_results h3 a::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.parse_article
            )

    def parse_article(self, response):
        for href in response.css('div.download_wrapper a[href$=".pdf"]::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.save_pdf
            )

    def save_pdf(self, response):
        path = response.url.split('/')[-1]
        self.logger.info('Saving PDF %s', path)
        with open(path, 'wb') as f:
            f.write(response.body)

Using Scrapy to to find and download pdf files from a website

Related

Recent Posts