Extracting data from HTML table
Solution 1:
A Python solution using BeautifulSoup4 (Edit: with proper skipping. Edit3: Using class="details"
to select the table
):
from bs4 import BeautifulSoup
html = """
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>"""
soup = BeautifulSoup(html)
table = soup.find("table", attrs={"class":"details"})
# The first tr contains the field names.
headings = [th.get_text() for th in table.find("tr").find_all("th")]
datasets = []
for row in table.find_all("tr")[1:]:
dataset = zip(headings, (td.get_text() for td in row.find_all("td")))
datasets.append(dataset)
print datasets
The result looks like this:
[[(u'Tests', u'103'),
(u'Failures', u'24'),
(u'Success Rate', u'76.70%'),
(u'Average Time', u'71 ms'),
(u'Min Time', u'0 ms'),
(u'Max Time', u'829 ms')]]
Edit2: To produce the desired output, use something like this:
for dataset in datasets:
for field in dataset:
print "{0:<16}: {1}".format(field[0], field[1])
Result:
Tests : 103
Failures : 24
Success Rate : 76.70%
Average Time : 71 ms
Min Time : 0 ms
Max Time : 829 ms
Solution 2:
Use pandas.read_html:
import pandas as pd
html_tables = pd.read_html('resources/test.html')
df = html_tables[0]
df.T # transpose to align
0
Tests 103
Failures 24
Success Rate 76.70%
Average Time 71 ms
Solution 3:
Here is the top answer, adapted for Python3 compatibility, and improved by stripping whitespace in cells:
from bs4 import BeautifulSoup
html = """
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>"""
soup = BeautifulSoup(s, 'html.parser')
table = soup.find("table")
# The first tr contains the field names.
headings = [th.get_text().strip() for th in table.find("tr").find_all("th")]
print(headings)
datasets = []
for row in table.find_all("tr")[1:]:
dataset = dict(zip(headings, (td.get_text() for td in row.find_all("td"))))
datasets.append(dataset)
print(datasets)
Solution 4:
Assuming your html code is stored in a mycode.html file, here is a bash way:
paste -d: <(grep '<th>' mycode.html | sed -e 's,</*th>,,g') <(grep '<td>' mycode.html | sed -e 's,</*td>,,g')
note: the output is not perfectly aligned
Solution 5:
undef $/;
$text = <DATA>;
@tabs = $text =~ m!<table.*?>(.*?)</table>!gms;
for (@tabs) {
@th = m!<th>(.*?)</th>!gms;
@td = m!<td>(.*?)</td>!gms;
}
for $i (0..$#th) {
printf "%-16s\t: %s\n", $th[$i], $td[$i];
}
__DATA__
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>
output as follows:
Tests : 103
Failures : 24
Success Rate : 76.70%
Average Time : 71 ms
Min Time : 0 ms
Max Time : 829 ms