Highlight keywords in a paragraph

I need to highlight a keyword in a paragraph, as google does in its search results. Let's assume that I have a MySQL db with blog posts. When a user searches for a certain keyword I wish to return the posts which contain those keywords, but to show only parts of the posts (the paragraph which contain the searched keyword) and to highlight those keywords.

My plan is this:

  • find the post id which has the searched keyword in it's content;
  • read the content of that post again and put each word in a fixed buffer array (50 words) until I find the keyword.

Can you help me with some logic, or at least to tell my if my logic is ok? I'm in a PHP learning stage.


If it contains html (note that this is a pretty robust solution):

$string = '<p>foo<b>bar</b></p>';
$keyword = 'foo';
$dom = new DomDocument();
$dom->loadHtml($string);
$xpath = new DomXpath($dom);
$elements = $xpath->query('//*[contains(.,"'.$keyword.'")]');
foreach ($elements as $element) {
    foreach ($element->childNodes as $child) {
        if (!$child instanceof DomText) continue;
        $fragment = $dom->createDocumentFragment();
        $text = $child->textContent;
        $stubs = array();
        while (($pos = stripos($text, $keyword)) !== false) {
            $fragment->appendChild(new DomText(substr($text, 0, $pos)));
            $word = substr($text, $pos, strlen($keyword));
            $highlight = $dom->createElement('span');
            $highlight->appendChild(new DomText($word));
            $highlight->setAttribute('class', 'highlight');
            $fragment->appendChild($highlight);
            $text = substr($text, $pos + strlen($keyword));
        }
        if (!empty($text)) $fragment->appendChild(new DomText($text));
        $element->replaceChild($fragment, $child);
    }
}
$string = $dom->saveXml($dom->getElementsByTagName('body')->item(0)->firstChild);

Results in:

<p><span class="highlight">foo</span><b>bar</b></p>

And with:

$string = '<body><p>foobarbaz<b>bar</b></p></body>';
$keyword = 'bar';

You get (broken onto multiple lines for readability):

<p>foo
    <span class="highlight">bar</span>
    baz
    <b>
        <span class="highlight">bar</span>
    </b>
</p>

Beware of non-dom solutions (like regex or str_replace) since highlighting something like "div" has a tendency of completely destroying your HTML... This will only ever "highlight" strings in the body, never inside of a tag...


Edit Since you want Google style results, here's one way of doing it:

function getKeywordStubs($string, array $keywords, $maxStubSize = 10) {
    $dom = new DomDocument();
    $dom->loadHtml($string);
    $xpath = new DomXpath($dom);
    $results = array();
    $maxStubHalf = ceil($maxStubSize / 2);
    foreach ($keywords as $keyword) {
        $elements = $xpath->query('//*[contains(.,"'.$keyword.'")]');
        $replace = '<span class="highlight">'.$keyword.'</span>';
        foreach ($elements as $element) {
            $stub = $element->textContent;
            $regex = '#^.*?((\w*\W*){'.
                 $maxStubHalf.'})('.
                 preg_quote($keyword, '#').
                 ')((\w*\W*){'.
                 $maxStubHalf.'}).*?$#ims';
            preg_match($regex, $stub, $match);
            var_dump($regex, $match);
            $stub = preg_replace($regex, '\\1\\3\\4', $stub);
            $stub = str_ireplace($keyword, $replace, $stub);
            $results[] = $stub;
        }
    }
    $results = array_unique($results);
    return $results;
}

Ok, so what that does is return an array of matches with $maxStubSize words around it (namely up to half that number before, and half after)...

So, given a string:

<p>a whole 
    <b>bunch of</b> text 
    <a>here for</a> 
    us to foo bar baz replace out from this string
    <b>bar</b>
</p>

Calling getKeywordStubs($string, array('bar', 'bunch')) will result in:

array(4) {
  [0]=>
  string(75) "here for us to foo <span class="highlight">bar</span> baz replace out from "
  [3]=>
  string(34) "<span class="highlight">bar</span>"
  [4]=>
  string(62) "a whole <span class="highlight">bunch</span> of text here for "
  [7]=>
  string(39) "<span class="highlight">bunch</span> of"
}

So, then you could build your result blurb by sorting the list by strlen and then picking the two longest matches... (assuming php 5.3+):

usort($results, function($str1, $str2) { 
    return strlen($str2) - strlen($str1);
});
$description = implode('...', array_slice($results, 0, 2));

Which results in:

here for us to foo <span class="highlight">bar</span> baz replace out...a whole <span class="highlight">bunch</span> of text here for 

I hope that helps... (I do feel this is a bit... bloated... I'm sure there are better ways to do this, but here's one way)...


Maybe you could do something like this when you're connected to the database:

$keyword = $_REQUEST["keyword"]; //fetch the keyword from the request
$result = mysql_query("SELECT * FROM `posts` WHERE `content` LIKE '%".
        mysql_real_escape_string($keyword)."%'"); //ask the database for the posttexts
while ($row = mysql_fetch_array($result)) {//do the following for each result:
  $text = $row["content"];//we're only interested in the content at the moment
  $text=substr ($text, strrpos($text, $keyword)-150, 300); //cut out
  $text=str_replace($keyword, '<strong>'.$keyword.'</strong>', $text); //highlight
  echo htmlentities($text); //print it
  echo "<hr>";//draw a line under it
}

If you wish to cut out the relevant paragraphs, after doing the above mentions str_replace function, you can use stripos() to find the position of these strong sections, and use an offset of that location with substr() to cut out a section of the paragraph, such as:

$searchterms;

foreach($searchterms as $search)
{
$paragraph = str_replace($search, "<strong>$search</strong>", $paragraph);
}

$pos = 0;

for($i = 0; $i < 4; $i++)  
{  
$pos = stripos($paragraph, "<strong>", $pos);  
$section[$i] = substr($paragraph, $pos - 100, 200);
}

which will give you an array of small sentences (200 characters each) to use how you wish. It may also be beneficial to search for the nearest space from the cutting locations, and cut from there to prevent half-words. Oh, and you also need to check for errors, but I'll leave that but up to you.


You could try exploding your database search result set into an array using explode and then usearray_search() on each search result. Set the $distance variable in the example below to how many words you'd like to appear on either side of the first match of the $keyword.

In the example, I've included lorum ipsum text as an example database result paragraph and set the $keyword to 'scelerisque'. You'd obviously replace these in your code.

//example paragraph text
$lorum = 'Nunc nec magna at nibh imperdiet dignissim quis eu velit. 
vel mattis odio rutrum nec. Etiam sit amet tortor nibh, molestie 
vestibulum tortor. Integer condimentum magna dictum purus vehicula 
et scelerisque mauris viverra. Nullam in lorem erat. Ut dolor libero, 
tristique et pellentesque sed, mattis eget dui. Cum sociis natoque 
penatibus et magnis dis parturient montes, nascetur ridiculus mus. 
.';

//turn paragraph into array
$ipsum = explode(' ',$lorum);
//set keyword
$keyword = 'scelerisque';
//set excerpt distance
$distance = 10;

//look for keyword in paragraph array, return array key of first match
$match_key = array_search($keyword,$ipsum);

if(!empty($match_key)){

    foreach($ipsum as $key=>$value){
        //if paragraph array key inside excerpt distance
        if($key > $match_key-$distance and $key< $match_key+$distance){ 
            //if array key matches keyword key, bold the word
            if($key == $match_key){
                $word = '<b>'.$value.'</b>';
                }
            else{
                $word = $value;
                }
            //create excerpt array to hold words within distance
            $excerpt[] = $word;
            }

        }
    //turn excerpt array into a string
    $excerpt = implode(' ',$excerpt);
    }
//print the string
echo $excerpt;

$excerpt returns: "vestibulum tortor. Integer condimentum magna dictum purus vehicula et scelerisque mauris viverra. Nullam in lorem erat. Ut dolor libero,"