Regular expression to remove HTML tags
Using a regular expression to parse HTML is fraught with pitfalls. HTML is not a regular language and hence can't be 100% correctly parsed with a regex. This is just one of many problems you will run into. The best approach is to use an HTML / XML parser to do this for you.
Here is a link to a blog post I wrote awhile back which goes into more details about this problem.
- http://blogs.msdn.com/b/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx
That being said, here's a solution that should fix this particular problem. It in no way is a perfect solution though.
var pattern = @"<(img|a)[^>]*>(?<content>[^<]*)<";
var regex = new Regex(pattern);
var m = regex.Match(sSummary);
if ( m.Success ) {
sResult = m.Groups["content"].Value;
To turn this:
'<td>mamma</td><td><strong>papa</strong></td>'
into this:
'mamma papa'
You need to replace the tags with spaces:
.replace(/<[^>]*>/g, ' ')
and reduce any duplicate spaces into single spaces:
.replace(/\s{2,}/g, ' ')
then trim away leading and trailing spaces with:
.trim();
Meaning that your remove tag function look like this:
function removeTags(string){
return string.replace(/<[^>]*>/g, ' ')
.replace(/\s{2,}/g, ' ')
.trim();
}