Best way to parse HTML in Javascript
Do the following steps:
- Create a new
documentFragment
- Put your HTML string in it
- Use selectors to get what you want
Why do all the parsing work - which won't work anyways, since HTML is not parsable via RegExp - when you have the best HTML parser available? (the Browser)
You can use jQuery to easily traverse the DOM and create an object with the structure automatically.
var $dom = $('<html>').html(the_html_string_variable_goes_here);
var featureInfo = {};
$('table:has(.dataLayer)', $dom).each(function(){
var $tbl = $(this);
var section = $tbl.find('.dataLayer').text();
var obj = [];
var $structure = $tbl.find('.dataHeaders');
var structure = $structure.find('th').map(function(){return $(this).text().toLowerCase();});
var $datarows= $structure.nextAll('tr');
$datarows.each(function(i){
obj[i] = {};
$(this).find('td').each(function(index,element){
obj[i][structure[index]] = $(element).text();
});
});
featureInfo[section] = obj;
});
Working Demo
The code can work with multiple tables with different structures inside.. and also multiple data rows inside each table..
The featureInfo will hold the final structure and data, and can be accessed like
alert( featureInfo['Tibetan Villages'][0]['English Translation'] );
or
alert( featureInfo['Tibetan Villages'][0].id );
The "correct" way to do it is with DOMParser
. Do it like this:
var parsed=new DOMParser.parseFromString(htmlString,'text/html');
Or, if you're worried about browser compatibility, use the polyfill on the MDN documentation:
/*
* DOMParser HTML extension
* 2012-09-04
*
* By Eli Grey, http://eligrey.com
* Public domain.
* NO WARRANTY EXPRESSED OR IMPLIED. USE AT YOUR OWN RISK.
*/
/*! @source https://gist.github.com/1129031 */
/*global document, DOMParser*/
(function(DOMParser) {
"use strict";
var
DOMParser_proto = DOMParser.prototype
, real_parseFromString = DOMParser_proto.parseFromString
;
// Firefox/Opera/IE throw errors on unsupported types
try {
// WebKit returns null on unsupported types
if ((new DOMParser).parseFromString("", "text/html")) {
// text/html parsing is natively supported
return;
}
} catch (ex) {}
DOMParser_proto.parseFromString = function(markup, type) {
if (/^\s*text\/html\s*(?:;|$)/i.test(type)) {
var
doc = document.implementation.createHTMLDocument("")
;
if (markup.toLowerCase().indexOf('<!doctype') > -1) {
doc.documentElement.innerHTML = markup;
}
else {
doc.body.innerHTML = markup;
}
return doc;
} else {
return real_parseFromString.apply(this, arguments);
}
};
}(DOMParser));
Change server-side code if you can (add JSON)
If you're the one that generates the resulting HTML on the server side you could as well generate a JSON there and pass it inside the HTML with the content. You wouldn't have to parse anything on the client side and all data would be immediately available to your client scripts.
You could easily put JSON in table
element as a data
attribute value:
<table class="featureInfo2" data-json="{ID:3394, Latitude:29.1, Longitude:93.15, PlaceName:'བསྡམས་གྲོང་ཚོ།', Translation:'Dam Drongtso'}">
...
</table>
Or you could add data
attributes to TDs that contain data and parse only those using jQuery selectors and generating Javascript object out of them. No need for RegExp parsing.