Parse HTML with PHP's HTML DOMDocument

Solution 1:

If you want to get :

  • The text
  • that's inside a <div> tag with class="text"
  • that's, itself, inside a <div> with class="main"

I would say the easiest way is not to use DOMDocument::getElementsByTagName -- which will return all tags that have a specific name (while you only want some of them).

Instead, I would use an XPath query on your document, using the DOMXpath class.


For example, something like this should do, to load the HTML string into a DOM object, and instance the DOMXpath class :

$html = <<<HTML
<div class="main">
    <div class="text">
    Capture this text 1
    </div>
</div>

<div class="main">
    <div class="text">
    Capture this text 2
    </div>
</div>
HTML;

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);


And, then, you can use XPath queries, with the DOMXPath::query method, that returns the list of elements you were searching for :

$tags = $xpath->query('//div[@class="main"]/div[@class="text"]');
foreach ($tags as $tag) {
    var_dump(trim($tag->nodeValue));
}


And executing this gives me the following output :

string 'Capture this text 1' (length=19)
string 'Capture this text 2' (length=19)

Solution 2:

You can use http://simplehtmldom.sourceforge.net/

It is very simple easy to use DOM parser written in php, by which you can easily fetch the content of div tag.

Something like this:

// Find all <div> which have attribute id=text
$ret = $html->find('div[id=text]'); 

See the documentation of it for more help.