How can I convert an HTML table to CSV?
Solution 1:
This method is not really a library OR a program, but for ad hoc conversions you can
- put the HTML for a table in a text file called something.xls
- open it with a spreadsheet
- save it as CSV.
I know this works with Excel, and I believe I've done it with the OpenOffice spreadsheet.
But you probably would prefer a Perl or Ruby script...
Solution 2:
Sorry for resurrecting an ancient thread, but I recently wanted to do this, but I wanted a 100% portable bash script to do it. So here's my solution using only grep and sed.
The below was bashed out very quickly, and so could be made much more elegant, but I'm just getting started really with sed/awk etc...
curl "http://www.webpagewithtableinit.com/" 2>/dev/null | grep -i -e '</\?TABLE\|</\?TD\|</\?TR\|</\?TH' | sed 's/^[\ \t]*//g' | tr -d '\n' | sed 's/<\/TR[^>]*>/\n/Ig' | sed 's/<\/\?\(TABLE\|TR\)[^>]*>//Ig' | sed 's/^<T[DH][^>]*>\|<\/\?T[DH][^>]*>$//Ig' | sed 's/<\/T[DH][^>]*><T[DH][^>]*>/,/Ig'
As you can see I've got the page source using curl, but you could just as easily feed in the table source from elsewhere.
Here's the explanation:
Get the Contents of the URL using cURL, dump stderr to null (no progress meter)
curl "http://www.webpagewithtableinit.com/" 2>/dev/null
.
I only want Table elements (return only lines with TABLE,TR,TH,TD tags)
| grep -i -e '</\?TABLE\|</\?TD\|</\?TR\|</\?TH'
.
Remove any Whitespace at the beginning of the line.
| sed 's/^[\ \t]*//g'
.
Remove newlines
| tr -d '\n\r'
.
Replace </TR>
with newline
| sed 's/<\/TR[^>]*>/\n/Ig'
.
Remove TABLE and TR tags
| sed 's/<\/\?\(TABLE\|TR\)[^>]*>//Ig'
.
Remove ^<TD>
, ^<TH>
, </TD>$
, </TH>$
| sed 's/^<T[DH][^>]*>\|<\/\?T[DH][^>]*>$//Ig'
.
Replace </TD><TD>
with comma
| sed 's/<\/T[DH][^>]*><T[DH][^>]*>/,/Ig'
.
Note that if any of the table cells contain commas, you may need to escape them first, or use a different delimiter.
Hope this helps someone!
Solution 3:
Here's a ruby script that uses nokogiri -- http://nokogiri.rubyforge.org/nokogiri/
require 'nokogiri'
doc = Nokogiri::HTML(table_string)
doc.xpath('//table//tr').each do |row|
row.xpath('td').each do |cell|
print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
end
print "\n"
end
Worked for my basic test case.
Solution 4:
Here's a short Python program I wrote to complete this task. It was written in a couple of minutes, so it can probably be made better. Not sure how it'll handle nested tables (probably it'll do bad stuff) or multiple tables (probably they'll just appear one after another). It doesn't handle colspan
or rowspan
.
Enjoy.
from HTMLParser import HTMLParser
import sys
import re
class HTMLTableParser(HTMLParser):
def __init__(self, row_delim="\n", cell_delim="\t"):
HTMLParser.__init__(self)
self.despace_re = re.compile(r'\s+')
self.data_interrupt = False
self.first_row = True
self.first_cell = True
self.in_cell = False
self.row_delim = row_delim
self.cell_delim = cell_delim
def handle_starttag(self, tag, attrs):
self.data_interrupt = True
if tag == "table":
self.first_row = True
self.first_cell = True
elif tag == "tr":
if not self.first_row:
sys.stdout.write(self.row_delim)
self.first_row = False
self.first_cell = True
self.data_interrupt = False
elif tag == "td" or tag == "th":
if not self.first_cell:
sys.stdout.write(self.cell_delim)
self.first_cell = False
self.data_interrupt = False
self.in_cell = True
def handle_endtag(self, tag):
self.data_interrupt = True
if tag == "td" or tag == "th":
self.in_cell = False
def handle_data(self, data):
if self.in_cell:
#if self.data_interrupt:
# sys.stdout.write(" ")
sys.stdout.write(self.despace_re.sub(' ', data).strip())
self.data_interrupt = False
parser = HTMLTableParser()
parser.feed(sys.stdin.read())