Anyone know of a good Python based web crawler that I could use?

I'm half-tempted to write my own, but I don't really have enough time right now. I've seen the Wikipedia list of open source crawlers but I'd prefer something written in Python. I realize that I could probably just use one of the tools on the Wikipedia page and wrap it in Python. I might end up doing that - if anyone has any advice about any of those tools, I'm open to hearing about them. I've used Heritrix via its web interface and I found it to be quite cumbersome. I definitely won't be using a browser API for my upcoming project.

Thanks in advance. Also, this is my first SO question!

Solution 1:

  • Mechanize is my favorite; great high-level browsing capabilities (super-simple form filling and submission).
  • Twill is a simple scripting language built on top of Mechanize
  • BeautifulSoup + urllib2 also works quite nicely.
  • Scrapy looks like an extremely promising project; it's new.

Solution 2:

Use Scrapy.

It is a twisted-based web crawler framework. Still under heavy development but it works already. Has many goodies:

  • Built-in support for parsing HTML, XML, CSV, and Javascript
  • A media pipeline for scraping items with images (or any other media) and download the image files as well
  • Support for extending Scrapy by plugging your own functionality using middlewares, extensions, and pipelines
  • Wide range of built-in middlewares and extensions for handling of compression, cache, cookies, authentication, user-agent spoofing, robots.txt handling, statistics, crawl depth restriction, etc
  • Interactive scraping shell console, very useful for developing and debugging
  • Web management console for monitoring and controlling your bot
  • Telnet console for low-level access to the Scrapy process

Example code to extract information about all torrent files added today in the mininova torrent site, by using a XPath selector on the HTML returned:

class Torrent(ScrapedItem):

class MininovaSpider(CrawlSpider):
    domain_name = ''
    start_urls = ['']
    rules = [Rule(RegexLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]

    def parse_torrent(self, response):
        x = HtmlXPathSelector(response)
        torrent = Torrent()

        torrent.url = response.url = x.x("//h1/text()").extract()
        torrent.description = x.x("//div[@id='description']").extract()
        torrent.size = x.x("//div[@id='info-left']/p[2]/text()[2]").extract()
        return [torrent]

Solution 3:

Check the HarvestMan, a multi-threaded web-crawler written in Python, also give a look to the module.

And here you can find code samples to build a simple web-crawler.

Solution 4:

I've used Ruya and found it pretty good.

Solution 5:

I hacked the above script to include a login page as I needed it to access a drupal site. Not pretty but may help someone out there.


import httplib2
import urllib
import urllib2
from cookielib import CookieJar
import sys
import re
from HTMLParser import HTMLParser

class miniHTMLParser( HTMLParser ):

  viewedQueue = []
  instQueue = []
  headers = {}
  opener = ""

  def get_next_link( self ):
    if self.instQueue == []:
      return ''
      return self.instQueue.pop(0)

  def gethtmlfile( self, site, page ):
        url = 'http://'+site+''+page
        response =
    except Exception, err:
        print " Error retrieving: "+page
        sys.stderr.write('ERROR: %s\n' % str(err))
    return "" 

    return resppage

  def loginSite( self, site_url ):
    cj = CookieJar()
    self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

    url = 'http://'+site_url 
        params = {'name': 'customer_admin', 'pass': 'customer_admin123', 'opt': 'Log in', 'form_build_id': 'form-3560fb42948a06b01d063de48aa216ab', 'form_id':'user_login_block'}
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    self.headers = { 'User-Agent' : user_agent }

    data = urllib.urlencode(params)
    response =, data)
    print "Logged in"

    except Exception, err:
    print " Error logging in"
    sys.stderr.write('ERROR: %s\n' % str(err))

    return 1

  def handle_starttag( self, tag, attrs ):
    if tag == 'a':
      newstr = str(attrs[0][1])
      print newstr
      if'http', newstr) == None:
        if'mailto', newstr) == None:
          if'#', newstr) == None:
            if (newstr in self.viewedQueue) == False:
              print "  adding", newstr
              self.instQueue.append( newstr )
              self.viewedQueue.append( newstr )
            print "  ignoring", newstr
          print "  ignoring", newstr
        print "  ignoring", newstr

def main():

  if len(sys.argv)!=3:
    print "usage is ./ site link"

  mySpider = miniHTMLParser()

  site = sys.argv[1]
  link = sys.argv[2]

  url_login_link = site+"/node?destination=node"
  print "\nLogging in", url_login_link
  x = mySpider.loginSite( url_login_link )

  while link != '':

    print "\nChecking link ", link

    # Get the file from the site and link
    retfile = mySpider.gethtmlfile( site, link )

    # Feed the file into the HTML parser

    # Search the retfile here

    # Get the next link in level traversal order
    link = mySpider.get_next_link()


  print "\ndone\n"

if __name__ == "__main__":