How can I scrape an HTML table to CSV?

Solution 1:

  • Select the HTML table in your tools's UI and copy it into the clipboard (if that's possible
  • Paste it into Excel.
  • Save as CSV file

However, this is a manual solution not an automated one.

Solution 2:

using python:

for example imagine you want to scrape forex quotes in csv form from some site like:fxquotes

then...

from BeautifulSoup import BeautifulSoup
import urllib,string,csv,sys,os
from string import replace

date_s = '&date1=01/01/08'
date_f = '&date=11/10/08'
fx_url = 'http://www.oanda.com/convert/fxhistory?date_fmt=us'
fx_url_end = '&lang=en&margin_fixed=0&format=CSV&redirected=1'
cur1,cur2 = 'USD','AUD'
fx_url = fx_url + date_f + date_s + '&exch=' + cur1 +'&exch2=' + cur1
fx_url = fx_url +'&expr=' + cur2 +  '&expr2=' + cur2 + fx_url_end
data = urllib.urlopen(fx_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('pre', limit=1))
data = replace(data,'[<pre>','')
data = replace(data,'</pre>]','')
file_location = '/Users/location_edit_this'
file_name = file_location + 'usd_aus.csv'
file = open(file_name,"w")
file.write(data)
file.close()

edit: to get values from a table: example from: palewire

from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mech = Browser()

url = "http://www.palewire.com/scrape/albums/2007.html"
page = mech.open(url)

html = page.read()
soup = BeautifulSoup(html)

table = soup.find("table", border=1)

for row in table.findAll('tr')[1:]:
    col = row.findAll('td')

    rank = col[0].string
    artist = col[1].string
    album = col[2].string
    cover_link = col[3].img['src']

    record = (rank, artist, album, cover_link)
    print "|".join(record)

Solution 3:

This is my python version using the (currently) latest version of BeautifulSoup which can be obtained using, e.g.,

$ sudo easy_install beautifulsoup4

The script reads HTML from the standard input, and outputs the text found in all tables in proper CSV format.

#!/usr/bin/python
from bs4 import BeautifulSoup
import sys
import re
import csv

def cell_text(cell):
    return " ".join(cell.stripped_strings)

soup = BeautifulSoup(sys.stdin.read())
output = csv.writer(sys.stdout)

for table in soup.find_all('table'):
    for row in table.find_all('tr'):
        col = map(cell_text, row.find_all(re.compile('t[dh]')))
        output.writerow(col)
    output.writerow([])

Solution 4:

Even easier (because it saves it for you for next time) ...

In Excel

Data/Import External Data/New Web Query

will take you to a url prompt. Enter your url, and it will delimit available tables on the page to import. Voila.

Solution 5:

Two ways come to mind (especially for those of us that don't have Excel):

  • Google Spreadsheets has an excellent importHTML function:
    • =importHTML("http://example.com/page/with/table", "table", index
    • Index starts at 1
    • I recommend a copy and paste values shortly after import
    • File -> Download as -> CSV
  • Python's superb Pandas library has handy read_html and to_csv functions
    • Here's a basic Python3 script that prompts for the URL, which table at that URL, and a filename for the CSV.