How to get the scrapy failure URLs?

I'm a newbie of scrapy and it's amazing crawler framework i have known!

In my project, I sent more than 90, 000 requests, but there are some of them failed. I set the log level to be INFO, and i just can see some statistics but no details.

2012-12-05 21:03:04+0800 [pd_spider] INFO: Dumping spider stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/twisted.internet.error.ConnectionDone': 1,
 'downloader/request_bytes': 46282582,
 'downloader/request_count': 92383,
 'downloader/request_method_count/GET': 92383,
 'downloader/response_bytes': 123766459,
 'downloader/response_count': 92382,
 'downloader/response_status_count/200': 92382,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2012, 12, 5, 13, 3, 4, 836000),
 'item_scraped_count': 46191,
 'request_depth_max': 1,
 'scheduler/memory_enqueued': 92383,
 'start_time': datetime.datetime(2012, 12, 5, 12, 23, 25, 427000)}

Is there any way to get more detail report? For example, show those failed URLs. Thanks!

Yes, this is possible.

The code below adds a failed_urls list to a basic spider class and appends urls to it if the response status of the url is 404 (this would need to be extended to cover other error statuses as required).
Next I added a handle that joins the list into a single string and adds it to the spider's stats when the spider is closed.
Based on your comments, it's possible to track Twisted errors, and some of the answers below give examples on how to handle that particular use case
The code has been updated to work with Scrapy 1.8. All thanks to this should go to Juliano Mendieta, since all I did was simply to add his suggested edits and confirm that the spider worked as intended.

from scrapy import Spider, signals

class MySpider(Spider):
    handle_httpstatus_list = [404] 
    name = "myspider"
    allowed_domains = ["example.com"]
    start_urls = [
        'http://www.example.com/thisurlexists.html',
        'http://www.example.com/thisurldoesnotexist.html',
        'http://www.example.com/neitherdoesthisone.html'
    ]

    def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            self.failed_urls = []

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.handle_spider_closed, signals.spider_closed)
        return spider

    def parse(self, response):
        if response.status == 404:
            self.crawler.stats.inc_value('failed_url_count')
            self.failed_urls.append(response.url)

    def handle_spider_closed(self, reason):
        self.crawler.stats.set_value('failed_urls', ', '.join(self.failed_urls))

    def process_exception(self, response, exception, spider):
        ex_class = "%s.%s" % (exception.__class__.__module__, exception.__class__.__name__)
        self.crawler.stats.inc_value('downloader/exception_count', spider=spider)
        self.crawler.stats.inc_value('downloader/exception_type_count/%s' % ex_class, spider=spider)

Example output (note that the downloader/exception_count* stats will only appear if exceptions are actually thrown - I simulated them by trying to run the spider after I'd turned off my wireless adapter):

2012-12-10 11:15:26+0000 [myspider] INFO: Dumping Scrapy stats:
    {'downloader/exception_count': 15,
     'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 15,
     'downloader/request_bytes': 717,
     'downloader/request_count': 3,
     'downloader/request_method_count/GET': 3,
     'downloader/response_bytes': 15209,
     'downloader/response_count': 3,
     'downloader/response_status_count/200': 1,
     'downloader/response_status_count/404': 2,
     'failed_url_count': 2,
     'failed_urls': 'http://www.example.com/thisurldoesnotexist.html, http://www.example.com/neitherdoesthisone.html'
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2012, 12, 10, 11, 15, 26, 874000),
     'log_count/DEBUG': 9,
     'log_count/ERROR': 2,
     'log_count/INFO': 4,
     'response_received_count': 3,
     'scheduler/dequeued': 3,
     'scheduler/dequeued/memory': 3,
     'scheduler/enqueued': 3,
     'scheduler/enqueued/memory': 3,
     'spider_exceptions/NameError': 2,
     'start_time': datetime.datetime(2012, 12, 10, 11, 15, 26, 560000)}

Here's another example how to handle and collect 404 errors (checking github help pages):

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.item import Item, Field


class GitHubLinkItem(Item):
    url = Field()
    referer = Field()
    status = Field()


class GithubHelpSpider(CrawlSpider):
    name = "github_help"
    allowed_domains = ["help.github.com"]
    start_urls = ["https://help.github.com", ]
    handle_httpstatus_list = [404]
    rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),)

    def parse_item(self, response):
        if response.status == 404:
            item = GitHubLinkItem()
            item['url'] = response.url
            item['referer'] = response.request.headers.get('Referer')
            item['status'] = response.status

            return item

Just run scrapy runspider with -o output.json and see list of items in the output.json file.

Scrapy ignores 404 by default and does not parse it. If you are getting an error code 404 in response, you can handle this with a very easy way.

In settings.py, write:

HTTPERROR_ALLOWED_CODES = [404,403]

And then handle the response status code in your parse function:

def parse(self,response):
    if response.status == 404:
        #your action on error

The answers from @Talvalin and @alecxe helped me a great deal, but they do not seem to capture downloader events that do not generate a response object (for instance, twisted.internet.error.TimeoutError and twisted.web.http.PotentialDataLoss). These errors show up in the stats dump at the end of the run, but without any meta info.

As I found out here, the errors are tracked by the stats.py middleware, captured in the DownloaderStats class' process_exception method, and specifically in the ex_class variable, which increments each error type as necessary, and then dumps the counts at the end of the run.

To match such errors with information from the corresponding request object, you can add a unique id to each request (via request.meta), then pull it into the process_exception method of stats.py:

self.stats.set_value('downloader/my_errs/{0}'.format(request.meta), ex_class)

That will generate a unique string for each downloader-based error not accompanied by a response. You can then save the altered stats.py as something else (e.g. my_stats.py), add it to the downloadermiddlewares (with the right precedence), and disable the stock stats.py:

DOWNLOADER_MIDDLEWARES = {
    'myproject.my_stats.MyDownloaderStats': 850,
    'scrapy.downloadermiddleware.stats.DownloaderStats': None,
    }

The output at the end of the run looks like this (here using meta info where each request url is mapped to a group_id and member_id separated by a slash, like '0/14'):

{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.web.http.PotentialDataLoss': 3,
 'downloader/my_errs/0/1': 'twisted.web.http.PotentialDataLoss',
 'downloader/my_errs/0/38': 'twisted.web.http.PotentialDataLoss',
 'downloader/my_errs/0/86': 'twisted.web.http.PotentialDataLoss',
 'downloader/request_bytes': 47583,
 'downloader/request_count': 133,
 'downloader/request_method_count/GET': 133,
 'downloader/response_bytes': 3416996,
 'downloader/response_count': 130,
 'downloader/response_status_count/200': 95,
 'downloader/response_status_count/301': 24,
 'downloader/response_status_count/302': 8,
 'downloader/response_status_count/500': 3,
 'finish_reason': 'finished'....}

This answer deals with non-downloader-based errors.

How to get the scrapy failure URLs?

Related

Recent Posts