How to saveHTML of DOMDocument without HTML wrapper?

All of these answers are now wrong, because as of PHP 5.4 and Libxml 2.6 loadHTML now has a $option parameter which instructs Libxml about how it should parse the content.

Therefore, if we load the HTML with these options

$html->loadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

when doing saveHTML() there will be no doctype, no <html>, and no <body>.

LIBXML_HTML_NOIMPLIED turns off the automatic adding of implied html/body elements LIBXML_HTML_NODEFDTD prevents a default doctype being added when one is not found.

Full documentation about Libxml parameters is here

(Note that loadHTML docs say that Libxml 2.6 is needed, but LIBXML_HTML_NODEFDTD is only available in Libxml 2.7.8 and LIBXML_HTML_NOIMPLIED is available in Libxml 2.7.7)

Just remove the nodes directly after loading the document with loadHTML():

# remove <!DOCTYPE 
$doc->removeChild($doc->doctype);           

# remove <html><body></body></html> 
$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);

The issue with the top answer is that LIBXML_HTML_NOIMPLIED is unstable.

It can reorder elements (particularly, moving the top element's closing tag to the bottom of the document), add random p tags, and perhaps a variety of other issues[1]. It may remove the html and body tags for you, but at the cost of unstable behavior. In production, that's a red flag. In short:

Don't use LIBXML_HTML_NOIMPLIED. Instead, use substr.

Think about it. The lengths of <html><body> and </body></html> are fixed and at both ends of the document - their sizes never change, and neither do their positions. This allows us to use substr to cut them away:

$dom = new domDocument; 
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD);

echo substr($dom->saveHTML(), 12, -15); // the star of this operation

(THIS IS NOT THE FINAL SOLUTION HOWEVER! See below for the complete answer, keep reading for context)

We cut 12 away from the start of the document because <html><body> = 12 characters (<<>>+html+body = 4+4+4), and we go backwards and cut 15 off the end because \n</body></html> = 15 characters (\n+//+<<>>+body+html = 1 + 2 + 4 + 4 + 4)

Notice that I still use LIBXML_HTML_NODEFDTD omit the !DOCTYPE from being included. First, this simplifies the substr removal of the HTML/BODY tags. Second, we don't remove the doctype with substr because we don't know if the 'default doctype' will always be something of a fixed length. But, most importantly, LIBXML_HTML_NODEFDTD stops the DOM parser from applying a non-HTML5 doctype to the document - which at least prevents the parser from treating elements it doesn't recognize as loose text.

We know for a fact that the HTML/BODY tags are of fixed lengths and positions, and we know that constants like LIBXML_HTML_NODEFDTD are never removed without some type of deprecation notice, so the above method should roll well into the future, BUT...

...the only caveat is that the DOM implementation could change the way in HTML/BODY tags are placed within the document - for instance, removing the newline at the end of the document, adding spaces between the tags, or adding newlines.

This can be remedied by searching for the positions of the opening and closing tags for body, and using those offsets as for our lengths to trim off. We use strpos and strrpos to find the offsets from the front and back, respectively:

$dom = new domDocument; 
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD);

$trim_off_front = strpos($dom->saveHTML(),'<body>') + 6;
// PositionOf<body> + 6 = Cutoff offset after '<body>'
// 6 = Length of '<body>'

$trim_off_end = (strrpos($dom->saveHTML(),'</body>')) - strlen($dom->saveHTML());
// ^ PositionOf</body> - LengthOfDocument = Relative-negative cutoff offset before '</body>'

echo substr($dom->saveHTML(), $trim_off_front, $trim_off_end);

In closing, a repeat of the final, future-proof answer:

$dom = new domDocument; 
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD);

$trim_off_front = strpos($dom->saveHTML(),'<body>') + 6;
$trim_off_end = (strrpos($dom->saveHTML(),'</body>')) - strlen($dom->saveHTML());

echo substr($dom->saveHTML(), $trim_off_front, $trim_off_end);

No doctype, no html tag, no body tag. We can only hope the DOM parser will receive a fresh coat of paint soon and we can more directly eliminate these unwanted tags.

Use saveXML() instead, and pass the documentElement as an argument to it.

$innerHTML = '';
foreach ($document->getElementsByTagName('p')->item(0)->childNodes as $child) {
    $innerHTML .= $document->saveXML($child);
}
echo $innerHTML;

http://php.net/domdocument.savexml

How to saveHTML of DOMDocument without HTML wrapper?

Related

Recent Posts