Regular expression to remove HTML tags

Using a regular expression to parse HTML is fraught with pitfalls. HTML is not a regular language and hence can't be 100% correctly parsed with a regex. This is just one of many problems you will run into. The best approach is to use an HTML / XML parser to do this for you.

Here is a link to a blog post I wrote awhile back which goes into more details about this problem.

http://blogs.msdn.com/b/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx

That being said, here's a solution that should fix this particular problem. It in no way is a perfect solution though.

var pattern = @"<(img|a)[^>]*>(?<content>[^<]*)<";
var regex = new Regex(pattern);
var m = regex.Match(sSummary);
if ( m.Success ) { 
  sResult = m.Groups["content"].Value;

To turn this:

'<td>mamma</td><td><strong>papa</strong></td>'

into this:

'mamma papa'

You need to replace the tags with spaces:

.replace(/<[^>]*>/g, ' ')

and reduce any duplicate spaces into single spaces:

.replace(/\s{2,}/g, ' ')

then trim away leading and trailing spaces with:

.trim();

Meaning that your remove tag function look like this:

function removeTags(string){
  return string.replace(/<[^>]*>/g, ' ')
               .replace(/\s{2,}/g, ' ')
               .trim();
}

Regular expression to remove HTML tags

Related

Recent Posts