How to give URL to scrapy for crawling?
Solution 1:
I'm not really sure about the commandline option. However, you could write your spider like this.
class MySpider(BaseSpider):
name = 'my_spider'
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = [kwargs.get('start_url')]
And start it like:
scrapy crawl my_spider -a start_url="http://some_url"
Solution 2:
An even easier way to allow multiple url-arguments than what Peter suggested is by giving them as a string with the urls separated by a comma, like this:
-a start_urls="http://example1.com,http://example2.com"
In the spider you would then simply split the string on ',' and get an array of urls:
self.start_urls = kwargs.get('start_urls').split(',')
Solution 3:
Use scrapy parse command. You can parse a url with your spider. url is passed from command.
$ scrapy parse http://www.example.com/ --spider=spider-name
http://doc.scrapy.org/en/latest/topics/commands.html#parse
Solution 4:
Sjaak Trekhaak has the right idea and here is how to allow multiples:
class MySpider(scrapy.Spider):
"""
This spider will try to crawl whatever is passed in `start_urls` which
should be a comma-separated string of fully qualified URIs.
Example: start_urls=http://localhost,http://example.com
"""
def __init__(self, name=None, **kwargs):
if 'start_urls' in kwargs:
self.start_urls = kwargs.pop('start_urls').split(',')
super(Spider, self).__init__(name, **kwargs)