php regex to match outside of html tags

I am making a preg_replace on html page. My pattern is aimed to add surrounding tag to some words in html. However, sometimes my regular expression modifies html tags. For example, when I try to replace this text:

<a href="example.com" alt="yasar home page">yasar</a>

So that yasar reads <span class="selected-word">yasar</span> , my regular expression also replaces yasar in alt attribute of anchor tag. Current preg_replace() I am using looks like this:

preg_replace("/(asf|gfd|oyws)/", '<span class=something>${1}</span>',$target);

How can I make a regular expression, so that it doesn't match anything inside a html tag?


You can use an assertion for that, as you just have to ensure that the searched words occur somewhen after an >, or before any <. The latter test is easier to accomplish as lookahead assertions can be variable length:

/(asf|foo|barr)(?=[^>]*(<|$))/

See also http://www.regular-expressions.info/lookaround.html for a nice explanation of that assertion syntax.


Yasar, resurrecting this question because it had another solution that wasn't mentioned.

Instead of just checking that the next tag character is an opening tag, this solution skips all <full tags>.

With all the disclaimers about using regex to parse html, here is the regex:

<[^>]*>(*SKIP)(*F)|word1|word2|word3

Here is a demo. In code, it looks like this:

$target = "word1 <a skip this word2 >word2 again</a> word3";
$regex = "~<[^>]*>(*SKIP)(*F)|word1|word2|word3~";
$repl= '<span class="">\0</span>';
$new=preg_replace($regex,$repl,$target);
echo htmlentities($new);

Here is an online demo of this code.

Reference

  1. How to match pattern except in situations s1, s2, s3
  2. How to match a pattern unless...