PHP: using DOMDocument whenever I try to write UTF-8 it writes the hexadecimal notation of it

When I try to write UTF-8 Strings into an XML file using DOMDocument it actually writes the hexadecimal notation of the string instead of the string itself.

for example:

ירושלים

instead of:

ירושלים

Any ideas how to resolve the issue?


Solution 1:

Ok, here you go:

$dom = new DOMDocument('1.0', 'utf-8');
$dom->appendChild($dom->createElement('root'));
$dom->documentElement->appendChild(new DOMText('ירושלים'));
echo $dom->saveXml();

will work fine, because in this case, the document you constructed will retain the encoding specified as the second argument:

<?xml version="1.0" encoding="utf-8"?>
<root>ירושלים</root>

However, once you load XML into a Document that does not specify an encoding, you will lose anything you declared in the constructor, which means:

$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadXml('<root/>'); // missing prolog
$dom->documentElement->appendChild(new DOMText('ירושלים'));
echo $dom->saveXml();

will not have an encoding of utf-8:

<?xml version="1.0"?>
<root>&#x5D9;&#x5E8;&#x5D5;&#x5E9;&#x5DC;&#x5D9;&#x5DD;</root>

So if you loadXML something, make sure it is

$dom = new DOMDocument();
$dom->loadXml('<?xml version="1.0" encoding="utf-8"?><root/>');
$dom->documentElement->appendChild(new DOMText('ירושלים'));
echo $dom->saveXml();

and it will work as expected.

As an alternative, you can also specify the encoding after loading the document.

Solution 2:

If you want to output UTF-8 with DOMDocument, you need to specify that. Simple, isn't it? If you already smell a trick question, you're not too far off, but on first sight, it really is straight forward.

Consider the following (UTF-8 encoded) code-example that outputs hexadecimal entities:

$dom = new DOMDocument();
$dom->loadXml('<root>ירושלים</root>');
$dom->save('php://output');

Output:

<?xml version="1.0"?>
<root>&#x5D9;&#x5E8;&#x5D5;&#x5E9;&#x5DC;&#x5D9;&#x5DD;</root>

As written, if you want to output this as UTF-8, you need to specify it, and it is straight forward:

...
$dom->encoding = 'UTF-8';
$dom->save('php://output');

The output then is in UTF-8 explicitly:

<?xml version="1.0" encoding="UTF-8"?>
<root>ירושלים</root>

So much for the straight forward part. If you are interested in the dirty little details, you are free to read on - if not, please do not ask "why?" :).

I just wrote "in UTF-8 explicitly" because also in the first example the output is UTF-8 encoded, the XML just contained hexadecimal entities which is perfectly valid - even in UTF-8!

You already notice that I start with nit-picking here, but remember: UTF-8 is the default encoding of XML.

And if you now start to say: Hey wait, if the default encoding is UTF-8 anyway, why does PHPs DOMDocument use the entities in the first place?

Well the truth is, it does not contrary to the finding in the question. Not always.

See the following example which is using an XML-comment instead of a node value containing the Ivrit letters:

$dom = new DOMDocument();
$dom->loadXml('<root><!-- ירושלים --></root>');
$dom->save('php://output');

Output:

<?xml version="1.0"?>
<root><!-- ירושלים --></root>

Okay, all clear? So the dirty little secret here is: Whether you've got those XML entities in there or not - for the document it does not make a difference, it is just a different form of writing the same XML character data. And you already feel invited: Lets try CDATA instead for the first example:

$dom = new DOMDocument();
$dom->loadXML("<root><![CDATA[ירושלים]]></root>");
$dom->save('php://output');

Output:

<?xml version="1.0"?>
<root><![CDATA[ירושלים]]></root>

As this demonstrates like with the XML-comment example before, there are no XML entities used here. Well, they would not be valid anyway, like with the XML-comment example.

For the overview lets create an example that contains all these:

$dom = new DOMDocument();
$dom->loadXML("<!-- ירושלים --><root>&#x5D9;רושלים <![CDATA[ירושלים]]></root>");
$dom->save('php://output');

Output:

<?xml version="1.0"?>
<!-- ירושלים -->
<root>&#x5D9;&#x5E8;&#x5D5;&#x5E9;&#x5DC;&#x5D9;&#x5DD; <![CDATA[ירושלים]]></root>

Lessons learned:

  • UTF-8 is always used. Just some entities are used in PCDATA unless the UTF-8 encoding is specified. If a different to UTF-8 encoding is specified, different rules apply.
  • You can not specify if you want to use entities or not for output by loading an XML document as UTF-8 encoded string in PHPs DOMDocument per-se. Not even with libxml flags nor by providing a BOM. [1]
  • You can specify that you do not want to use entities by setting the documents encoding to UTF-8.
  • If you can, you can manipulate the input string having an XML-Declaration specifying the documents encoding as outlined in gordon's answer.

Tip: If your string has an XML-Declaration that mismatches the strings encoding or you want to change either of both before loading the string into DOMDocument you need to change the XML-Declaration and/or re-encode the string. This has been covered in an answer to the question PHP XMLReader, get the version and encoding by showing how the XMLRecoder class works.

And that's it hopefully.


[1] Probably if you load from a HTTP request and you provide stream context and flag the character encoding via meta-data - but this should be tested first, I do not know. That the BOM does not work is somewhat a sign that all these things do not work.

Solution 3:

Apparently passing the documentElement as $node to saveXML works around this, although I can't say I understand why.

e.g.

$dom->saveXML($dom->documentElement);

rather than:

$dom->saveXML();

Source: http://www.php.net/manual/en/domdocument.savexml.php#88525

Solution 4:

To the point answer is:

When your function starts, right after you get the content, do this:

$content = mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8');

And then start the new document etc. Check this as example:

if ( empty( $content ) ) {
    return false;
}
$doc = new DOMDocument('1.0', 'utf-8');
libxml_use_internal_errors(true);
$doc->LoadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

Then do whatever you were intending to do with your code.