Best way to parse HTML in Javascript

Do the following steps:

  1. Create a new documentFragment
  2. Put your HTML string in it
  3. Use selectors to get what you want

Why do all the parsing work - which won't work anyways, since HTML is not parsable via RegExp - when you have the best HTML parser available? (the Browser)


You can use jQuery to easily traverse the DOM and create an object with the structure automatically.

var $dom = $('<html>').html(the_html_string_variable_goes_here);
var featureInfo = {};

$('table:has(.dataLayer)', $dom).each(function(){
    var $tbl = $(this);
    var section = $tbl.find('.dataLayer').text();
    var obj = [];
    var $structure = $tbl.find('.dataHeaders');
    var structure = $structure.find('th').map(function(){return $(this).text().toLowerCase();});
    var $datarows= $structure.nextAll('tr');
    $datarows.each(function(i){
        obj[i] = {};
        $(this).find('td').each(function(index,element){
            obj[i][structure[index]] = $(element).text();
        });
    });
    featureInfo[section] = obj;
});

Working Demo

The code can work with multiple tables with different structures inside.. and also multiple data rows inside each table..

The featureInfo will hold the final structure and data, and can be accessed like

alert( featureInfo['Tibetan Villages'][0]['English Translation'] );

or

alert( featureInfo['Tibetan Villages'][0].id );

The "correct" way to do it is with DOMParser. Do it like this:

var parsed=new DOMParser.parseFromString(htmlString,'text/html');

Or, if you're worried about browser compatibility, use the polyfill on the MDN documentation:

/*
 * DOMParser HTML extension
 * 2012-09-04
 * 
 * By Eli Grey, http://eligrey.com
 * Public domain.
 * NO WARRANTY EXPRESSED OR IMPLIED. USE AT YOUR OWN RISK.
 */

/*! @source https://gist.github.com/1129031 */
/*global document, DOMParser*/

(function(DOMParser) {
    "use strict";

    var
      DOMParser_proto = DOMParser.prototype
    , real_parseFromString = DOMParser_proto.parseFromString
    ;

    // Firefox/Opera/IE throw errors on unsupported types
    try {
        // WebKit returns null on unsupported types
        if ((new DOMParser).parseFromString("", "text/html")) {
            // text/html parsing is natively supported
            return;
        }
    } catch (ex) {}

    DOMParser_proto.parseFromString = function(markup, type) {
        if (/^\s*text\/html\s*(?:;|$)/i.test(type)) {
            var
              doc = document.implementation.createHTMLDocument("")
            ;
                if (markup.toLowerCase().indexOf('<!doctype') > -1) {
                    doc.documentElement.innerHTML = markup;
                }
                else {
                    doc.body.innerHTML = markup;
                }
            return doc;
        } else {
            return real_parseFromString.apply(this, arguments);
        }
    };
}(DOMParser));

Change server-side code if you can (add JSON)

If you're the one that generates the resulting HTML on the server side you could as well generate a JSON there and pass it inside the HTML with the content. You wouldn't have to parse anything on the client side and all data would be immediately available to your client scripts.

You could easily put JSON in table element as a data attribute value:

<table class="featureInfo2" data-json="{ID:3394, Latitude:29.1, Longitude:93.15, PlaceName:'བསྡམས་གྲོང་ཚོ།', Translation:'Dam Drongtso'}">
    ...
</table>

Or you could add data attributes to TDs that contain data and parse only those using jQuery selectors and generating Javascript object out of them. No need for RegExp parsing.