php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #67727 Inconsistent loading and saving of DOMDocument
Submitted: 2014-07-31 14:35 UTC Modified: -
Votes:12
Avg. Score:3.8 ± 1.0
Reproduced:11 of 12 (91.7%)
Same Version:4 (36.4%)
Same OS:0 (0.0%)
From: villascape at gmail dot com Assigned:
Status: Open Package: DOM XML related
PHP Version: 5.5.15 OS: Centos 6.5
Private report: No CVE-ID: None
Have you experienced this issue?
Rate the importance of this bug to you:

 [2014-07-31 14:35 UTC] villascape at gmail dot com
Description:
------------
DOMDocument::loadHTML() and DOMDocument::saveHTML() is not consistent for some input strings; specifically '<body>&nbsp;</body>'.

Note that I am using PHP version 5.5.14-1.ius.centos6.x86_64, and not 5.5.15.

Test script:
---------------
<?php
$str1='<body>&nbsp;</body>';
echo('Initial<pre>'.htmlspecialchars($str1).'</pre>'); //<body>&nbsp;</body>

$dom = new DOMDocument();   //Default is UTF-8, but iso-8859-1 is available if required

$dom->loadHTML($str1);
$xpath = new DOMXPath($dom);
$body = $dom->getElementsByTagName('body')->item(0);
$str2=$dom->saveHTML($body);
echo('First Option 1<pre>'.htmlspecialchars($str2).'</pre>'); //<body> </body>

$dom->loadHTML($str2);
$xpath = new DOMXPath($dom);
$body = $dom->getElementsByTagName('body')->item(0);
$str3=$dom->saveHTML($body);
echo('Second Option 1<pre>'.htmlspecialchars($str3).'</pre>'); //<body>Â </body>
?>

Expected result:
----------------
Initial String
<body>&nbsp;</body>
Returned String Pass 1
<body>&nbsp;</body>
Returned String Pass 2
<body>&nbsp;</body>


Actual result:
--------------
Initial String
<body>&nbsp;</body>
Returned String Pass 1
<body> </body>
Returned String Pass 2
<body>Â </body>


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2019-12-03 18:01 UTC] coleman at civicrm dot org
This is still a problem in PHP 7.2. Non-breaking space characters become garbled by both the saveHTML and the saveXML functions. Strangely, this depends on how it's represented in the input string fed into loadHTML. Sending it in as a unicode non-break space character triggers the bug, but sending it in as &nbsp; works fine. The trouble is that saveHTML spits it out as the unicode character, so a round-trip through DOMDocument and back again will always result in garbled output. Here is a PHPUnit test to demonstrate the problem:

----------

class myTest extends \PHPUnit\Framework\TestCase {
  public function testNbsp() {
    $runThrough = function($html) {
      $doc = new DOMDocument();
      $doc->loadHTML("<html><body><div id=\"target\">$html</div></body></html>");

      $newHtml = '';
      foreach ($doc->getElementById('target')->childNodes as $node) {
        $newHtml .= $node->ownerDocument->saveXML($node);
      }
      return $newHtml;
    };

    $original = '<p>Hello '."\xc2\xa0".' world</p>';

    $pass1 = $runThrough($original);
    $pass2 = $runThrough($pass1);

    $this->assertEquals($pass1, $pass2);
  }
}

----------

Note that if we were to extend the test with more iterations, each runThrough will add another extra character to the output.
 
PHP Copyright © 2001-2020 The PHP Group
All rights reserved.
Last updated: Mon Jan 20 15:01:25 2020 UTC