php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #80665 DOMDocument object corruption during cloning
Submitted: 2021-01-25 10:18 UTC Modified: 2021-01-27 19:57 UTC
From: andrey at email dot dp dot ua Assigned:
Status: Open Package: DOM XML related
PHP Version: Irrelevant OS: Debian Linux
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: andrey at email dot dp dot ua
New email:
PHP Version: OS:

 

 [2021-01-25 10:18 UTC] andrey at email dot dp dot ua
Description:
------------
Description
------------
When DOMDocument is cloned, properties are cloned incorrectly.

saveHTML method of the cloned object provides different results as the same method of original object.

saveHTML of the cloned object launched with additional documentElement parameter provides result with symbols converted to numeric character references. But saveHTML launched without parameters returns correct result


Properties that are corrupted during cloning
---------------------------------------------
$DOMDocument->nodeType
if original object has nodeType XML_HTML_DOCUMENT_NODE, after cloning it will be set to XML_DOCUMENT_NODE

$DOMDocument->baseURI
value is lost during cloning

$DOMDocument->version
if not set on original object will be set to 1.0

$DOMDocument->xmlVersion
if not set on original object will be set to 1.0



Methods that has different result on cloned object
-------------------------------------------------
"Carriage-return" symbols in original document correctly returned by $DOMDocument->saveHTML() method, but replaced with on 
 when used $DOMDocument->saveHTML($DOMDocument->documentElement) on cloned object.

Test script:
---------------
<?php

$html = "<html><head><base href='https://php.net'></head><body>\r</body></html>";


$dom = new DOMDocument();
$dom->loadHTML($html);


$arr = array(
             'DOMDocument'         => $dom,
             'cloned by clone'     => clone $dom,
             'cloned by cloneNode' => $dom->cloneNode(true)
            );

foreach ($arr as $descr=>$obj)
{
    echo $descr.":\n";
    echo "--------------------------\n";

    echo "saveHTML:\n";
    echo $obj->saveHTML()."\n\n";

    echo "saveHTML via DOMDocument::documentElement:\n";
    echo $obj->saveHTML($obj->documentElement)."\n\n";

    echo "\$DOMDocument->nodeType   = ".$obj->nodeType."\n";
    echo "\$DOMDocument->baseURI    = ".$obj->baseURI."\n";
    echo "\$DOMDocument->version    = ".$obj->version."\n";
    echo "\$DOMDocument->xmlVersion = ".$obj->xmlVersion."\n\n\n";
}

Expected result:
----------------
[   three times   ]

saveHTML:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><base href="https://php.net"></head><body>
</body></html>


saveHTML(documentElement):
<html><head><base href="https://php.net"></head><body>
</body></html>

$DOMDocument->nodeType   = 13
$DOMDocument->baseURI    = https://php.net
$DOMDocument->version    =
$DOMDocument->xmlVersion =


Actual result:
--------------
DOMDocument:
--------------------------
saveHTML:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
</body></html>ase href="https://php.net"></head><body>


saveHTML(documentElement):
</body></html>ase href="https://php.net"></head><body>

$DOMDocument->nodeType   = 13
$DOMDocument->baseURI    = https://php.net
$DOMDocument->version    = 
$DOMDocument->xmlVersion = 



cloned by clone:
--------------------------
saveHTML:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
</body></html>ase href="https://php.net"></head><body>


saveHTML(documentElement):
<html><head><base href="https://php.net"></head><body>&#13;</body></html>

$DOMDocument->nodeType   = 9
$DOMDocument->baseURI    = 
$DOMDocument->version    = 1.0
$DOMDocument->xmlVersion = 1.0



cloned by cloneNode:
--------------------------
saveHTML:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
</body></html>ase href="https://php.net"></head><body>


saveHTML(documentElement):
<html><head><base href="https://php.net"></head><body>&#13;</body></html>

$DOMDocument->nodeType   = 9
$DOMDocument->baseURI    = 
$DOMDocument->version    = 1.0
$DOMDocument->xmlVersion = 1.0


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2021-01-26 07:11 UTC] glash dot gnome at gmail dot com
Dear,

Thank you for reporting this issue. You make php better. 

You found a libxml2 bug.

I fixed it and i requested a merge.
https://gitlab.gnome.org/GNOME/libxml2/-/merge_requests/99

-----------------------------------------------------------------

One question remains. 

Should the $document->xmlVersion be feeded after loadHTML() / loadXML() when not declared in xml ? Or keep the non-declaration ?
 [2021-01-27 19:57 UTC] andrey at email dot dp dot ua
Thank you for your work.


About your question... Good question

As for me - the answer is "No". I'll explain..

My main expectation from any solution - is predictability. Any solution should do nothing that was not directly requested by the user. As a result any loaded document should be equal with unloaded one, any property's value written should be equal with the property's value read and so on...

So the main question is not the question you ask but the next one: "What is it DOMDocument after some document was loaded by loadXML?"

If the answer is - "DOMDocument is a DOMDocument and somewhere inside it the loaded document stored. And these both entities are not depend with each other" - I have no idea what to expect from DOMDocument and for which task it can be used at all.

But if the answer is - "DOMDocument after some document was loaded is the loaded document's DOMDocument", than ANY value in DOMDocument SHOULD have the same value as in loaded document. If the value have no value in document, than that value should be set to NULL in DOMDocument.



And let's develop thought...


And even if DOMDocument was created with XML version (and encoding), after loading the document, this values should be rewritten to actual values from the document. (actually they shouldn't. Please read below)

You can tell that in this case we will lost the values of xml version and encoding that was set by the user and this is the same problem but from another side of view. But it's not. It looks like the same problem but it's the different problem.

This problem is in wrong definitions used by DOMDocument

XML version set in constructor is not the XML version of document. It's (should be) fallback XML version if no XML version will not be set inside the document.

Encoding - it's not the document's encoding. It's (should be) fallback encoding if no XML version will not be set inside the document

And these values should be stored separate from actual document's XML version and encoding. And at any time user could have ability to read document's encoding and this fallback encoding separately and get the actual document's values and corresponding fallback values.

Initial definitions are correct (with some limitations) for the new documents created with DOMDocument. Sometimes even in this situation it (possibly) can provide unwished results if user includes XML declaration in document.

But the problem is fully visible when created empty DOMDocument with preset XML version and encoding and than load the document in DOMDocument



As it seems to me - the root of problem is in absence of answer for the questions "What is the DOMDocument? What is describes? What it corresponds with?". And the sooner it will be provided clear and self-consistent answers for this questions the soon expectations from DOMDocument and work with DOMDocument will go on another much higher level of clearliness and predictability.


Thanx again
 [2021-02-09 08:19 UTC] glash dot gnome at gmail dot com
Hello,

I come to confirm that the request was merged on the master branch libxml2.


"does this need to handle the other discrepancies between the original and cloned nodes mentioned in the PHP bug report?"
-------------------------------------------------------------------
@Philip, Yes it is.



"What is it DOMDocument after some document was loaded by loadXML?"
-------------------------------------------------------------------
@Andrey, I know than the xml d├ęclaration(<?xml version="1.0"?>) is not part of DOM document.The purpose of the declaration is to prepare the agent to read the document. 


"What is the DOMDocument? What is describes? What it corresponds with?"
-----------------------------------------------------------------------
The implementation of the domdocument extension informs us that this is the internal tree structure of libxml. DOMDocument is therefore an xml representation.


By example, What DOMDocument is not :

// file "model-verbose.xml"
<?xml version="1.0"?>
<object type="Document">
  <property name="lang">fr</property>
  <property name="author">serge</property>
  <property name="update">09/02/21</property>
  <child>
    <object type="label">
      <property name="lang">en</property>
      <property name="text">Hello World !</property>
    </object>
  </child>
</object>

// file "model.xml"
<document lang="fr" author="serge" update="09/02/21">
  <label>Hello World !</label>
</document>

// file "model.json"
{
  document: {
    properties: [
      lang: "fr",
      author: "Serge",
      update: "09/02/21",
    ],
    children: [
      { label: { content: "Hello World !"} }
    ],
  }
}


$dom_v = My\Ext\Dom\Document::load("model-verbose.xml", $verbose);
$dom   = My\Ext\Dom\Document::load("model.xml");
$dom_j = My\Ext\Dom\Document::loadJson("model.json");

if($dom_v->root["lang"] == $dom->root["lang"]) {
  echo 'Same access, same Object Model' . PHP_EOL;
}
if($dom_v->root->children[0]["text"] == $dom->root->children[0]["text"]) {
  echo 'Same access, same Object Model' . PHP_EOL;
}

It's not the same xml( representation), but the same mydocument object model.


kind regards,
Serge
 
PHP Copyright © 2001-2021 The PHP Group
All rights reserved.
Last updated: Sat Oct 23 18:03:33 2021 UTC