Sanitize/Rewrite HTML on the Client Side

I need to display external resources loaded via cross domain requests and make sure to only display "safe" content.

Could use Prototype's String#stripScripts to remove script blocks. But handlers such as onclick or onerror are still there.

Is there any library which can at least

strip script blocks,
kill DOM handlers,
remove black listed tags (eg: embed or object).

So are any JavaScript related links and examples out there?

Solution 1:

Update 2016: There is now a Google Closure package based on the Caja sanitizer.

It has a cleaner API, was rewritten to take into account APIs available on modern browsers, and interacts better with Closure Compiler.

Shameless plug: see caja/plugin/html-sanitizer.js for a client side html sanitizer that has been thoroughly reviewed.

It is white-listed, not black-listed, but the whitelists are configurable as per CajaWhitelists

If you want to remove all tags, then do the following:

var tagBody = '(?:[^"\'>]|"[^"]*"|\'[^\']*\')*';

var tagOrComment = new RegExp(
    '<(?:'
    // Comment body.
    + '!--(?:(?:-*[^->])*--+|-?)'
    // Special "raw text" elements whose content should be elided.
    + '|script\\b' + tagBody + '>[\\s\\S]*?</script\\s*'
    + '|style\\b' + tagBody + '>[\\s\\S]*?</style\\s*'
    // Regular name
    + '|/?[a-z]'
    + tagBody
    + ')>',
    'gi');
function removeTags(html) {
  var oldHtml;
  do {
    oldHtml = html;
    html = html.replace(tagOrComment, '');
  } while (html !== oldHtml);
  return html.replace(/</g, '&lt;');
}

People will tell you that you can create an element, and assign innerHTML and then get the innerText or textContent, and then escape entities in that. Do not do that. It is vulnerable to XSS injection since <img src=bogus onerror=alert(1337)> will run the onerror handler even if the node is never attached to the DOM.

Solution 2:

The Google Caja HTML sanitizer can be made "web-ready" by embedding it in a web worker. Any global variables introduced by the sanitizer will be contained within the worker, plus processing takes place in its own thread.

For browsers that do not support Web Workers, we can use an iframe as a separate environment for the sanitizer to work in. Timothy Chien has a polyfill that does just this, using iframes to simulate Web Workers, so that part is done for us.

The Caja project has a wiki page on how to use Caja as a standalone client-side sanitizer:

Checkout the source, then build by running ant
Include html-sanitizer-minified.js or html-css-sanitizer-minified.js in your page
Call html_sanitize(...)

The worker script only needs to follow those instructions:

importScripts('html-css-sanitizer-minified.js'); // or 'html-sanitizer-minified.js'

var urlTransformer, nameIdClassTransformer;

// customize if you need to filter URLs and/or ids/names/classes
urlTransformer = nameIdClassTransformer = function(s) { return s; };

// when we receive some HTML
self.onmessage = function(event) {
    // sanitize, then send the result back
    postMessage(html_sanitize(event.data, urlTransformer, nameIdClassTransformer));
};

(A bit more code is needed to get the simworker library working, but it's not important to this discussion.)

Demo: https://dl.dropbox.com/u/291406/html-sanitize/demo.html

Solution 3:

Never trust the client. If you're writing a server application, assume that the client will always submit unsanitary, malicious data. It's a rule of thumb that will keep you out of trouble. If you can, I would advise doing all validation and sanitation in server code, which you know (to a reasonable degree) won't be fiddled with. Perhaps you could use a serverside web application as a proxy for your clientside code, which fetches from the 3rd party and does sanitation before sending it to the client itself?

[edit] I'm sorry, I misunderstood the question. However, I stand by my advice. Your users will probably be safer if you sanitize on the server before sending it to them.

Solution 4:

Now that all major browsers support sandboxed iframes, there is a much simpler way that I think can be secure. I'd love it if this answer could be reviewed by people who are more familiar with this kind of security issue.

NOTE: This method definitely will not work in IE 9 and earlier. See this table for browser versions that support sandboxing. (Note: the table seems to say it won't work in Opera Mini, but I just tried it, and it worked.)

The idea is to create a hidden iframe with JavaScript disabled, paste your untrusted HTML into it, and let it parse it. Then you can walk the DOM tree and copy out the tags and attributes that are considered safe.

The whitelists shown here are just examples. What's best to whitelist would depend on the application. If you need a more sophisticated policy than just whitelists of tags and attributes, that can be accommodated by this method, though not by this example code.

var tagWhitelist_ = {
  'A': true,
  'B': true,
  'BODY': true,
  'BR': true,
  'DIV': true,
  'EM': true,
  'HR': true,
  'I': true,
  'IMG': true,
  'P': true,
  'SPAN': true,
  'STRONG': true
};

var attributeWhitelist_ = {
  'href': true,
  'src': true
};

function sanitizeHtml(input) {
  var iframe = document.createElement('iframe');
  if (iframe['sandbox'] === undefined) {
    alert('Your browser does not support sandboxed iframes. Please upgrade to a modern browser.');
    return '';
  }
  iframe['sandbox'] = 'allow-same-origin';
  iframe.style.display = 'none';
  document.body.appendChild(iframe); // necessary so the iframe contains a document
  iframe.contentDocument.body.innerHTML = input;
  
  function makeSanitizedCopy(node) {
    if (node.nodeType == Node.TEXT_NODE) {
      var newNode = node.cloneNode(true);
    } else if (node.nodeType == Node.ELEMENT_NODE && tagWhitelist_[node.tagName]) {
      newNode = iframe.contentDocument.createElement(node.tagName);
      for (var i = 0; i < node.attributes.length; i++) {
        var attr = node.attributes[i];
        if (attributeWhitelist_[attr.name]) {
          newNode.setAttribute(attr.name, attr.value);
        }
      }
      for (i = 0; i < node.childNodes.length; i++) {
        var subCopy = makeSanitizedCopy(node.childNodes[i]);
        newNode.appendChild(subCopy, false);
      }
    } else {
      newNode = document.createDocumentFragment();
    }
    return newNode;
  };

  var resultElement = makeSanitizedCopy(iframe.contentDocument.body);
  document.body.removeChild(iframe);
  return resultElement.innerHTML;
};

SECURITY HOLE: Commenter @Explosion points out that an href attribute can contain JavaScript, like <a href="javascript:alert('Oops')">. It should be possible to catch that and remove it in the sanitization code, but the above code has not (yet) been updated to do that.

You can try it out here.

Note that I'm disallowing style attributes and tags in this example. If you allowed them, you'd probably want to parse the CSS and make sure it's safe for your purposes.

I've tested this on several modern browsers (Chrome 40, Firefox 36 Beta, IE 11, Chrome for Android), and on one old one (IE 8) to make sure it bailed before executing any scripts. I'd be interested to know if there are any browsers that have trouble with it, or any edge cases that I'm overlooking.

Solution 5:

You can't anticipate every possible weird type of malformed markup that some browser somewhere might trip over to escape blacklisting, so don't blacklist. There are many more structures you might need to remove than just script/embed/object and handlers.

Instead attempt to parse the HTML into elements and attributes in a hierarchy, then run all element and attribute names against an as-minimal-as-possible whitelist. Also check any URL attributes you let through against a whitelist (remember there are more dangerous protocols than just javascript:).

If the input is well-formed XHTML the first part of the above is much easier.

As always with HTML sanitisation, if you can find any other way to avoid doing it, do that instead. There are many, many potential holes. If the major webmail services are still finding exploits after this many years, what makes you think you can do better?