PHP :: Doc Bug #67727 :: Inconsistent loading and saving of DOMDocument

Inconsistent loading and saving of DOMDocument

Submitted:

2014-07-31 14:35 UTC

Modified:

2021-03-16 11:25 UTC

Votes:	12
Avg. Score:	3.8 ± 1.0
Reproduced:	11 of 12 (91.7%)
Same Version:	4 (36.4%)
Same OS:	0 (0.0%)

From:

villascape at gmail dot com

Assigned:

Status:

Verified

Package:

DOM XML related

PHP Version:

5.5.15

OS:

Centos 6.5

Private report:

CVE-ID:

None

View Developer Edit

[2014-07-31 14:35 UTC] villascape at gmail dot com

Description:
------------
DOMDocument::loadHTML() and DOMDocument::saveHTML() is not consistent for some input strings; specifically '<body>&nbsp;</body>'.

Note that I am using PHP version 5.5.14-1.ius.centos6.x86_64, and not 5.5.15.

Test script:
---------------
<?php
$str1='<body>&nbsp;</body>';
echo('Initial<pre>'.htmlspecialchars($str1).'</pre>'); //<body>&nbsp;</body>

$dom = new DOMDocument();   //Default is UTF-8, but iso-8859-1 is available if required

$dom->loadHTML($str1);
$xpath = new DOMXPath($dom);
$body = $dom->getElementsByTagName('body')->item(0);
$str2=$dom->saveHTML($body);
echo('First Option 1<pre>'.htmlspecialchars($str2).'</pre>'); //<body> </body>

$dom->loadHTML($str2);
$xpath = new DOMXPath($dom);
$body = $dom->getElementsByTagName('body')->item(0);
$str3=$dom->saveHTML($body);
echo('Second Option 1<pre>'.htmlspecialchars($str3).'</pre>'); //<body>Â </body>
?>

Expected result:
----------------
Initial String
<body>&nbsp;</body>
Returned String Pass 1
<body>&nbsp;</body>
Returned String Pass 2
<body>&nbsp;</body>


Actual result:
--------------
Initial String
<body>&nbsp;</body>
Returned String Pass 1
<body> </body>
Returned String Pass 2
<body>Â </body>

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2019-12-03 18:01 UTC] coleman at civicrm dot org

This is still a problem in PHP 7.2. Non-breaking space characters become garbled by both the saveHTML and the saveXML functions. Strangely, this depends on how it's represented in the input string fed into loadHTML. Sending it in as a unicode non-break space character triggers the bug, but sending it in as &nbsp; works fine. The trouble is that saveHTML spits it out as the unicode character, so a round-trip through DOMDocument and back again will always result in garbled output. Here is a PHPUnit test to demonstrate the problem:

----------

class myTest extends \PHPUnit\Framework\TestCase {
  public function testNbsp() {
    $runThrough = function($html) {
      $doc = new DOMDocument();
      $doc->loadHTML("<html><body><div id=\"target\">$html</div></body></html>");

      $newHtml = '';
      foreach ($doc->getElementById('target')->childNodes as $node) {
        $newHtml .= $node->ownerDocument->saveXML($node);
      }
      return $newHtml;
    };

    $original = '<p>Hello '."\xc2\xa0".' world</p>';

    $pass1 = $runThrough($original);
    $pass2 = $runThrough($pass1);

    $this->assertEquals($pass1, $pass2);
  }
}

----------

Note that if we were to extend the test with more iterations, each runThrough will add another extra character to the output.

[2021-03-12 16:32 UTC] cmb@php.net

-Status: Open +Status: Verified -Type: Bug +Type: Documentation Problem -Assigned To: +Assigned To: cmb

[2021-03-12 16:32 UTC] cmb@php.net

> Default is UTF-8, but iso-8859-1 is available if required

That is not true, at least not when a document is loaded.  When
libxml2 begins parsing a document, it tries to detect the encoding
by checking for the bytes values of the first characters of the
XML declaration.  That works well to detect any of the supported
Unicode encodings for XML documents, but e.g. misdetects
ISO-8859-*, and can't work for HTML at all.  Anyway, if no
encoding could be detected this way, libxml2 falls back to
checking for a BOM, and if that fails, the encoding is
unspecified.  The actual encoding may later be determined from the
XML declaration's encoding attribute, or from the respective meta
elements of the HTML.  In this case, this is not available, so the
text node is being read as single byte encoding.

Explicitly specifying UTF-8 as default when calling libxml2 would
be possible, but would be a BC break, and I'm not even sure
whether libxml2 would override this when it finds an encoding
specification/hint in the document later, so we cannot fix the
behavior.  Instead we should document it.

As workaround, you can prepend a BOM to signal the desired
encoding, i.e.

    $dom->loadHTML("\xEF\xBB\xBF$str2");

[2021-03-16 11:25 UTC] cmb@php.net

-Assigned To: cmb +Assigned To:

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2026 The PHP Group All rights reserved.	Last updated: Mon Jul 06 15:00:02 2026 UTC