Strip all HTML tags except links

Solution 1:

<(?!\/?a(?=>|\s.*>))\/?.*?>

Try this. Had something similar for p tags. Worked for them so don't see why not. Uses negative lookahead to check that it doesn't match a (prefixed with an optional / character) where (using positive lookahead) a (with optional / prefix) is followed by a > or a space, stuff and then >. This then matches up until the next > character. Put this in a subst with

s/<(?!\/?a(?=>|\s.*>))\/?.*?>//g;

This should leave only the opening and closing a tags

Solution 2:

I keep going on about it, but there's no way I can recommend regexr too often. It's fantastic for testing this type of things.

Solution 3:

In general there are problems with this approach. Regexes are best for 'flat' text matches - nested data pushes regex engines into areas for which they are not designed. General HTML parsing needs a parser not a regex engine (Google for the difference between regular and context-free languages if you want the full technical details).

It is easy to strip out all tags by replacing /</ and />/ with the empty string or their entity equivalents but selectively filtering HTML using regexes will be vulnerable to a wide range of accidental or malicious inputs breaking things.