Regex / DOMDocument - match and replace text not in a link

Solution 1:

Here is an UTF-8 safe solution, which not only works with properly formatted documents, but also with document fragments.

The mb_convert_encoding is needed, because loadHtml() seems to has a bug with UTF-8 encoding (see here and here).

The mb_substr is trimming the body tag from the output, this way you get back your original content without any additional markup.

<?php
$html = '<p>Match this text and replace it</p>
<p>Don\'t <a href="/">match this text</a></p>
<p>We still need to match this text and replace itŐŰ</p>
<p>This is <a href="#">a link <span>with <strong>don\'t match this text</strong> content</span></a></p>';

$dom = new DOMDocument();
// loadXml needs properly formatted documents, so it's better to use loadHtml, but it needs a hack to properly handle UTF-8 encoding
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));

$xpath = new DOMXPath($dom);

foreach($xpath->query('//text()[not(ancestor::a)]') as $node)
{
    $replaced = str_ireplace('match this text', 'MATCH', $node->wholeText);
    $newNode  = $dom->createDocumentFragment();
    $newNode->appendXML($replaced);
    $node->parentNode->replaceChild($newNode, $node);
}

// get only the body tag with its contents, then trim the body tag itself to get only the original content
echo mb_substr($dom->saveXML($xpath->query('//body')->item(0)), 6, -7, "UTF-8");

References:
1. find and replace keywords by hyperlinks in an html fragment, via php dom
2. Regex / DOMDocument - match and replace text not in a link
3. php problem with russian language
4. Why Does DOM Change Encoding?

I read dozens of answers in the subject, so I am sorry if I forgot somebody (please comment it and I will add yours as well in this case).

Thanks for Gordon and stillstanding for commenting on my other answer.

Solution 2:

Try this one:

$dom = new DOMDocument;
$dom->loadHTML($html_content);

function preg_replace_dom($regex, $replacement, DOMNode $dom, array $excludeParents = array()) {
  if (!empty($dom->childNodes)) {
    foreach ($dom->childNodes as $node) {
      if ($node instanceof DOMText && 
          !in_array($node->parentNode->nodeName, $excludeParents)) 
      {
        $node->nodeValue = preg_replace($regex, $replacement, $node->nodeValue);
      } 
      else
      {
        preg_replace_dom($regex, $replacement, $node, $excludeParents);
      }
    }
  }
}

preg_replace_dom('/match this text/i', 'IT WORKS', $dom->documentElement, array('a'));

Regex / DOMDocument - match and replace text not in a link

Solution 1:

Solution 2:

Related

Recent Posts