php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #67727 Inconsistent loading and saving of DOMDocument
Submitted: 2014-07-31 14:35 UTC Modified: 2021-03-16 11:25 UTC
Votes:12
Avg. Score:3.8 ± 1.0
Reproduced:11 of 12 (91.7%)
Same Version:4 (36.4%)
Same OS:0 (0.0%)
From: villascape at gmail dot com Assigned:
Status: Verified Package: DOM XML related
PHP Version: 5.5.15 OS: Centos 6.5
Private report: No CVE-ID: None
 [2014-07-31 14:35 UTC] villascape at gmail dot com
Description:
------------
DOMDocument::loadHTML() and DOMDocument::saveHTML() is not consistent for some input strings; specifically '<body>&nbsp;</body>'.

Note that I am using PHP version 5.5.14-1.ius.centos6.x86_64, and not 5.5.15.

Test script:
---------------
<?php
$str1='<body>&nbsp;</body>';
echo('Initial<pre>'.htmlspecialchars($str1).'</pre>'); //<body>&nbsp;</body>

$dom = new DOMDocument();   //Default is UTF-8, but iso-8859-1 is available if required

$dom->loadHTML($str1);
$xpath = new DOMXPath($dom);
$body = $dom->getElementsByTagName('body')->item(0);
$str2=$dom->saveHTML($body);
echo('First Option 1<pre>'.htmlspecialchars($str2).'</pre>'); //<body> </body>

$dom->loadHTML($str2);
$xpath = new DOMXPath($dom);
$body = $dom->getElementsByTagName('body')->item(0);
$str3=$dom->saveHTML($body);
echo('Second Option 1<pre>'.htmlspecialchars($str3).'</pre>'); //<body>Â </body>
?>

Expected result:
----------------
Initial String
<body>&nbsp;</body>
Returned String Pass 1
<body>&nbsp;</body>
Returned String Pass 2
<body>&nbsp;</body>


Actual result:
--------------
Initial String
<body>&nbsp;</body>
Returned String Pass 1
<body> </body>
Returned String Pass 2
<body>Â </body>


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2019-12-03 18:01 UTC] coleman at civicrm dot org
This is still a problem in PHP 7.2. Non-breaking space characters become garbled by both the saveHTML and the saveXML functions. Strangely, this depends on how it's represented in the input string fed into loadHTML. Sending it in as a unicode non-break space character triggers the bug, but sending it in as &nbsp; works fine. The trouble is that saveHTML spits it out as the unicode character, so a round-trip through DOMDocument and back again will always result in garbled output. Here is a PHPUnit test to demonstrate the problem:

----------

class myTest extends \PHPUnit\Framework\TestCase {
  public function testNbsp() {
    $runThrough = function($html) {
      $doc = new DOMDocument();
      $doc->loadHTML("<html><body><div id=\"target\">$html</div></body></html>");

      $newHtml = '';
      foreach ($doc->getElementById('target')->childNodes as $node) {
        $newHtml .= $node->ownerDocument->saveXML($node);
      }
      return $newHtml;
    };

    $original = '<p>Hello '."\xc2\xa0".' world</p>';

    $pass1 = $runThrough($original);
    $pass2 = $runThrough($pass1);

    $this->assertEquals($pass1, $pass2);
  }
}

----------

Note that if we were to extend the test with more iterations, each runThrough will add another extra character to the output.
 [2021-03-12 16:32 UTC] cmb@php.net
-Status: Open +Status: Verified -Type: Bug +Type: Documentation Problem -Assigned To: +Assigned To: cmb
 [2021-03-12 16:32 UTC] cmb@php.net
> Default is UTF-8, but iso-8859-1 is available if required

That is not true, at least not when a document is loaded.  When
libxml2 begins parsing a document, it tries to detect the encoding
by checking for the bytes values of the first characters of the
XML declaration.  That works well to detect any of the supported
Unicode encodings for XML documents, but e.g. misdetects
ISO-8859-*, and can't work for HTML at all.  Anyway, if no
encoding could be detected this way, libxml2 falls back to
checking for a BOM, and if that fails, the encoding is
unspecified.  The actual encoding may later be determined from the
XML declaration's encoding attribute, or from the respective meta
elements of the HTML.  In this case, this is not available, so the
text node is being read as single byte encoding.

Explicitly specifying UTF-8 as default when calling libxml2 would
be possible, but would be a BC break, and I'm not even sure
whether libxml2 would override this when it finds an encoding
specification/hint in the document later, so we cannot fix the
behavior.  Instead we should document it.

As workaround, you can prepend a BOM to signal the desired
encoding, i.e.

    $dom->loadHTML("\xEF\xBB\xBF$str2");
 [2021-03-16 11:25 UTC] cmb@php.net
-Assigned To: cmb +Assigned To:
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 14:01:29 2024 UTC