How to parse CDATA HTML-content of XML using SimpleXML?
I once answered it but I don't find the answer any longer.
If you take a look at the string (simplified/beautified):
<content:encoded><![CDATA[
<p>Lorem Ipsom</p>
<p>
<a href='laura-bertram-trance-gemini-145-1080.jpg'
title='<br>November 2012 calendar from 5.10 The Test<br> <a href="</a>
</p>]]>
</content:encoded>
You can see that you have HTML encoded inside the node-value of the <content:encoded>
element. So first you need to obtain the HTML value, which you already do:
$html = $boo->children('content', true)->encoded;
Then you need to parse the HTML inside $html
. With which libraries HTML parsing can be done with PHP is outlined in:
- How to parse and process HTML/XML with PHP?
If you decide to use the more or less recommended DOMDocument
for the job, you only need to get the attribute value of a certain element:
- PHP DOMDocument getting Attribute of Tag
Or for its sister library SimpleXML you already use (so this is more recommended, see as well the next section):
- How to get an attribute with SimpleXML?
In context of your question here the following tip:
You're using SimpleXML. DOMDocument is a sister-library, meaning you can interchange between the two so you don't need to learn a full new library.
For example, you can use only the HTML parsing feature of DOMDocument
, but import it then into SimpleXML
. This is useful, because SimpleXML does not support HTML parsing.
That works via simplexml_import_dom()
.
A simplified step-by-step example:
// get the HTML string out of the feed:
$htmlString = $boo->children('content', true)->encoded;
// create DOMDocument for HTML parsing:
$htmlParser = new DOMDocument();
// load the HTML:
$htmlParser->loadHTML($htmlString);
// import it into simplexml:
$html = simplexml_import_dom($htmlParser);
Now you can use $html
as a new SimpleXMLElement that represents the HTML document. As your HTML chunks did not have any <body>
tags, according to the HTML specification, they are put inside the <body>
tag. This will allow you for example to access the href
attribute of the first <a>
inside the second <p>
element in your example:#
// access the element you're looking for:
$href = $html->body->p[1]->a['href'];
Here the full view from above (Online Demo):
// get the HTML string out of the feed:
$htmlString = $boo->children('content', true)->encoded;
// create DOMDocument for HTML parsing:
$htmlParser = new DOMDocument();
// your HTML gives parser warnings, keep them internal:
libxml_use_internal_errors(true);
// load the HTML:
$htmlParser->loadHTML($htmlString);
// import it into simplexml:
$html = simplexml_import_dom($htmlParser);
// access the element you're looking for:
$href = $html->body->p[1]->a['href'];
// output it
echo $href, "\n";
And what it outputs:
laura-bertram-trance-gemini-145-1080.jpg