Web scraping with Python [closed]
Use urllib2 in combination with the brilliant BeautifulSoup library:
import urllib2
from BeautifulSoup import BeautifulSoup
# or if you're using BeautifulSoup4:
# from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://example.com').read())
for row in soup('table', {'class': 'spad'})[0].tbody('tr'):
tds = row('td')
print tds[0].string, tds[1].string
# will print date and sunrise
I'd really recommend Scrapy.
Quote from a deleted answer:
- Scrapy crawling is fastest than mechanize because uses asynchronous operations (on top of Twisted).
- Scrapy has better and fastest support for parsing (x)html on top of libxml2.
- Scrapy is a mature framework with full unicode, handles redirections, gzipped responses, odd encodings, integrated http cache, etc.
- Once you are into Scrapy, you can write a spider in less than 5 minutes that download images, creates thumbnails and export the extracted data directly to csv or json.