Anyone know of a good Python based web crawler that I could use?
I'm half-tempted to write my own, but I don't really have enough time right now. I've seen the Wikipedia list of open source crawlers but I'd prefer something written in Python. I realize that I could probably just use one of the tools on the Wikipedia page and wrap it in Python. I might end up doing that - if anyone has any advice about any of those tools, I'm open to hearing about them. I've used Heritrix via its web interface and I found it to be quite cumbersome. I definitely won't be using a browser API for my upcoming project.
Thanks in advance. Also, this is my first SO question!
Solution 1:
- Mechanize is my favorite; great high-level browsing capabilities (super-simple form filling and submission).
- Twill is a simple scripting language built on top of Mechanize
- BeautifulSoup + urllib2 also works quite nicely.
- Scrapy looks like an extremely promising project; it's new.
Solution 2:
Use Scrapy.
It is a twisted-based web crawler framework. Still under heavy development but it works already. Has many goodies:
- Built-in support for parsing HTML, XML, CSV, and Javascript
- A media pipeline for scraping items with images (or any other media) and download the image files as well
- Support for extending Scrapy by plugging your own functionality using middlewares, extensions, and pipelines
- Wide range of built-in middlewares and extensions for handling of compression, cache, cookies, authentication, user-agent spoofing, robots.txt handling, statistics, crawl depth restriction, etc
- Interactive scraping shell console, very useful for developing and debugging
- Web management console for monitoring and controlling your bot
- Telnet console for low-level access to the Scrapy process
Example code to extract information about all torrent files added today in the mininova torrent site, by using a XPath selector on the HTML returned:
class Torrent(ScrapedItem):
pass
class MininovaSpider(CrawlSpider):
domain_name = 'mininova.org'
start_urls = ['http://www.mininova.org/today']
rules = [Rule(RegexLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]
def parse_torrent(self, response):
x = HtmlXPathSelector(response)
torrent = Torrent()
torrent.url = response.url
torrent.name = x.x("//h1/text()").extract()
torrent.description = x.x("//div[@id='description']").extract()
torrent.size = x.x("//div[@id='info-left']/p[2]/text()[2]").extract()
return [torrent]
Solution 3:
Check the HarvestMan, a multi-threaded web-crawler written in Python, also give a look to the spider.py module.
And here you can find code samples to build a simple web-crawler.
Solution 4:
I've used Ruya and found it pretty good.
Solution 5:
I hacked the above script to include a login page as I needed it to access a drupal site. Not pretty but may help someone out there.
#!/usr/bin/python
import httplib2
import urllib
import urllib2
from cookielib import CookieJar
import sys
import re
from HTMLParser import HTMLParser
class miniHTMLParser( HTMLParser ):
viewedQueue = []
instQueue = []
headers = {}
opener = ""
def get_next_link( self ):
if self.instQueue == []:
return ''
else:
return self.instQueue.pop(0)
def gethtmlfile( self, site, page ):
try:
url = 'http://'+site+''+page
response = self.opener.open(url)
return response.read()
except Exception, err:
print " Error retrieving: "+page
sys.stderr.write('ERROR: %s\n' % str(err))
return ""
return resppage
def loginSite( self, site_url ):
try:
cj = CookieJar()
self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
url = 'http://'+site_url
params = {'name': 'customer_admin', 'pass': 'customer_admin123', 'opt': 'Log in', 'form_build_id': 'form-3560fb42948a06b01d063de48aa216ab', 'form_id':'user_login_block'}
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
self.headers = { 'User-Agent' : user_agent }
data = urllib.urlencode(params)
response = self.opener.open(url, data)
print "Logged in"
return response.read()
except Exception, err:
print " Error logging in"
sys.stderr.write('ERROR: %s\n' % str(err))
return 1
def handle_starttag( self, tag, attrs ):
if tag == 'a':
newstr = str(attrs[0][1])
print newstr
if re.search('http', newstr) == None:
if re.search('mailto', newstr) == None:
if re.search('#', newstr) == None:
if (newstr in self.viewedQueue) == False:
print " adding", newstr
self.instQueue.append( newstr )
self.viewedQueue.append( newstr )
else:
print " ignoring", newstr
else:
print " ignoring", newstr
else:
print " ignoring", newstr
def main():
if len(sys.argv)!=3:
print "usage is ./minispider.py site link"
sys.exit(2)
mySpider = miniHTMLParser()
site = sys.argv[1]
link = sys.argv[2]
url_login_link = site+"/node?destination=node"
print "\nLogging in", url_login_link
x = mySpider.loginSite( url_login_link )
while link != '':
print "\nChecking link ", link
# Get the file from the site and link
retfile = mySpider.gethtmlfile( site, link )
# Feed the file into the HTML parser
mySpider.feed(retfile)
# Search the retfile here
# Get the next link in level traversal order
link = mySpider.get_next_link()
mySpider.close()
print "\ndone\n"
if __name__ == "__main__":
main()