|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2014-07-31 14:35 UTC] villascape at gmail dot com
Description:
------------
DOMDocument::loadHTML() and DOMDocument::saveHTML() is not consistent for some input strings; specifically '<body> </body>'.
Note that I am using PHP version 5.5.14-1.ius.centos6.x86_64, and not 5.5.15.
Test script:
---------------
<?php
$str1='<body> </body>';
echo('Initial<pre>'.htmlspecialchars($str1).'</pre>'); //<body> </body>
$dom = new DOMDocument(); //Default is UTF-8, but iso-8859-1 is available if required
$dom->loadHTML($str1);
$xpath = new DOMXPath($dom);
$body = $dom->getElementsByTagName('body')->item(0);
$str2=$dom->saveHTML($body);
echo('First Option 1<pre>'.htmlspecialchars($str2).'</pre>'); //<body> </body>
$dom->loadHTML($str2);
$xpath = new DOMXPath($dom);
$body = $dom->getElementsByTagName('body')->item(0);
$str3=$dom->saveHTML($body);
echo('Second Option 1<pre>'.htmlspecialchars($str3).'</pre>'); //<body>Â </body>
?>
Expected result:
----------------
Initial String
<body> </body>
Returned String Pass 1
<body> </body>
Returned String Pass 2
<body> </body>
Actual result:
--------------
Initial String
<body> </body>
Returned String Pass 1
<body> </body>
Returned String Pass 2
<body>Â </body>
PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Tue Nov 04 20:00:01 2025 UTC |
This is still a problem in PHP 7.2. Non-breaking space characters become garbled by both the saveHTML and the saveXML functions. Strangely, this depends on how it's represented in the input string fed into loadHTML. Sending it in as a unicode non-break space character triggers the bug, but sending it in as works fine. The trouble is that saveHTML spits it out as the unicode character, so a round-trip through DOMDocument and back again will always result in garbled output. Here is a PHPUnit test to demonstrate the problem: ---------- class myTest extends \PHPUnit\Framework\TestCase { public function testNbsp() { $runThrough = function($html) { $doc = new DOMDocument(); $doc->loadHTML("<html><body><div id=\"target\">$html</div></body></html>"); $newHtml = ''; foreach ($doc->getElementById('target')->childNodes as $node) { $newHtml .= $node->ownerDocument->saveXML($node); } return $newHtml; }; $original = '<p>Hello '."\xc2\xa0".' world</p>'; $pass1 = $runThrough($original); $pass2 = $runThrough($pass1); $this->assertEquals($pass1, $pass2); } } ---------- Note that if we were to extend the test with more iterations, each runThrough will add another extra character to the output.> Default is UTF-8, but iso-8859-1 is available if required That is not true, at least not when a document is loaded. When libxml2 begins parsing a document, it tries to detect the encoding by checking for the bytes values of the first characters of the XML declaration. That works well to detect any of the supported Unicode encodings for XML documents, but e.g. misdetects ISO-8859-*, and can't work for HTML at all. Anyway, if no encoding could be detected this way, libxml2 falls back to checking for a BOM, and if that fails, the encoding is unspecified. The actual encoding may later be determined from the XML declaration's encoding attribute, or from the respective meta elements of the HTML. In this case, this is not available, so the text node is being read as single byte encoding. Explicitly specifying UTF-8 as default when calling libxml2 would be possible, but would be a BC break, and I'm not even sure whether libxml2 would override this when it finds an encoding specification/hint in the document later, so we cannot fix the behavior. Instead we should document it. As workaround, you can prepend a BOM to signal the desired encoding, i.e. $dom->loadHTML("\xEF\xBB\xBF$str2");