How can I use different pipelines for different spiders in a single Scrapy project
Solution 1:
Just remove all pipelines from main settings and use this inside spider.
This will define the pipeline to user per spider
class testSpider(InitSpider):
name = 'test'
custom_settings = {
'ITEM_PIPELINES': {
'app.MyPipeline': 400
}
}
Solution 2:
Building on the solution from Pablo Hoffman, you can use the following decorator on the process_item
method of a Pipeline object so that it checks the pipeline
attribute of your spider for whether or not it should be executed. For example:
def check_spider_pipeline(process_item_method):
@functools.wraps(process_item_method)
def wrapper(self, item, spider):
# message template for debugging
msg = '%%s %s pipeline step' % (self.__class__.__name__,)
# if class is in the spider's pipeline, then use the
# process_item method normally.
if self.__class__ in spider.pipeline:
spider.log(msg % 'executing', level=log.DEBUG)
return process_item_method(self, item, spider)
# otherwise, just return the untouched item (skip this step in
# the pipeline)
else:
spider.log(msg % 'skipping', level=log.DEBUG)
return item
return wrapper
For this decorator to work correctly, the spider must have a pipeline attribute with a container of the Pipeline objects that you want to use to process the item, for example:
class MySpider(BaseSpider):
pipeline = set([
pipelines.Save,
pipelines.Validate,
])
def parse(self, response):
# insert scrapy goodness here
return item
And then in a pipelines.py
file:
class Save(object):
@check_spider_pipeline
def process_item(self, item, spider):
# do saving here
return item
class Validate(object):
@check_spider_pipeline
def process_item(self, item, spider):
# do validating here
return item
All Pipeline objects should still be defined in ITEM_PIPELINES in settings (in the correct order -- would be nice to change so that the order could be specified on the Spider, too).
Solution 3:
You can use the name
attribute of the spider in your pipeline
class CustomPipeline(object)
def process_item(self, item, spider)
if spider.name == 'spider1':
# do something
return item
return item
Defining all pipelines this way can accomplish what you want.