php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #74988 DOMDocument::load() reports success but libxml_get_errors() return errors
Submitted: 2017-07-25 21:09 UTC Modified: 2017-07-27 01:24 UTC
From: paul at sparrowhawkcomputing dot com Assigned:
Status: Not a bug Package: DOM XML related
PHP Version: 5.6.31 OS: Windows 10 Pro
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If this is not your bug, you can add a comment by following this link.
If this is your bug, but you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: paul at sparrowhawkcomputing dot com
New email:
PHP Version: OS:

 

 [2017-07-25 21:09 UTC] paul at sparrowhawkcomputing dot com
Description:
------------
Given the following XML document in test.xml:

    <?xml version="1.0"?>
    <root xml:space='foo'/>

The script in the "Test Script" field below reports that the instance is loaded successfully while simultaneously reporting well-formedness errors.

How can this instance be successfully loaded while there are well-formedness errors reported?



Test script:
---------------
libxml_use_internal_errors( true );
$dom = new DOMDocument();
libxml_clear_errors();
$success = $dom->load( __DIR__ . '/test.xml' );
$xml = $dom->saveXML();
$errs = libxml_get_errors();
var_dump( $success );
var_dump( $errs );
var_dump( $xml );

Expected result:
----------------
Either:

    $success == false && ! empty( $errs ) && $xml === '<?xml version="1.0"?>'

or

    $success == true && empty( $errs ) && $xml === '<?xml version="1.0"?>
<root/>
'

That is, if libxml_get_errors() is going to return errors then DOMDocument::load() should return false.  If DOMDocument::load() is going to succeed, then @xml:space should be ignored and libxml_clear_errors() should be called internally before DOMDocument::load() returns.

Either alternative conforms to the XML spec, which says [1]:

    This specification does not give meaning to any value of xml:space other 
    than "default" and "preserve". It is an error for other values to be 
    specified; the XML processor may report the error or may recover by ignoring 
    the attribute specification or by reporting the (erroneous) value to the 
    application. Applications may ignore or reject erroneous values.

The status quo does not conform to the XML spec because it both reports the error and fails to ignore the @xml:space attribute.

I VERY MUCH prefer the first alternative, as it is consistent with XMLReader which correctly reports the well-formedness error and refuses to parse test.xml.

[1] https://www.w3.org/TR/REC-xml/#sec-white-space

Actual result:
--------------
bool(true)
array(1) {
  [0]=>
  object(LibXMLError)#260 (6) {
    ["level"]=>
    int(1)
    ["code"]=>
    int(102)
    ["column"]=>
    int(16)
    ["message"]=>
    string(69) "Invalid value "foo" for xml:space : "default" or "preserve" expected
"
    ["file"]=>
    string(87) "file://test.xml"
    ["line"]=>
    int(2)
  }
}
string(80) "<?xml version="1.0"?>
<root xml:space="foo"/>
"

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2017-07-25 21:49 UTC] requinix@php.net
-Status: Open +Status: Not a bug
 [2017-07-25 21:49 UTC] requinix@php.net
If there is a bug here then it is not with PHP. It is with libxml. You'd have to report the problem there.

But I don't see the problem. As you quoted, the processor may recover by reporting the erroneous value to the application - which is exactly what happened. Here's some formatting:

"the XML processor
- may report the error or
- may recover by
  * ignoring the attribute specification or by
  * reporting the (erroneous) value to the application"

Seems like you're interpreting it as "the XML processor may report the error or... by reporting the (erroneous) value to the application" but that doesn't make sense.


> and libxml_clear_errors() should be called internally before DOMDocument::load() returns
That would discard *all* errors during loading. A very bad idea.
 [2017-07-26 18:06 UTC] paul at sparrowhawkcomputing dot com
requinix@php.net:

I've looked into this a little more and realized that LibXMLError has a $level property and that only when $level === LIBXML_ERR_FATAL does libxml consider the error to be a well-formedness error.  In the case of @xml:space it only considers it a LIBXML_ERR_WARNING.  Until now, I thought if libxml_get_errors() returned a non-empty array that it meant that libxml consider something to be well-formed error (my bad).

So, can you confirm for me that if DOMDocument::load() and DOMDocument::loadXML() return true that libxml_get_errors() is guaranteed to not contain any errors with $level === LIBXML_ERR_FATAL?

If so, then you can close this as "not a bug".
 [2017-07-26 18:40 UTC] requinix@php.net
I can confirm [1] that if libxml says !xmlParserCtxt.wellFormed [2] and DOMDocument::$recover=false (the default) then PHP will return false.

I cannot confirm whether all LIBXML_ERR_FATAL are considered well-formed-ness errors, or the exact conditions for when wellFormed=0, but I think the answer to your question is basically still "yes".

[1] https://github.com/php/php-src/blob/PHP-5.6.31/ext/dom/document.c#L1590
[2] http://xmlsoft.org/html/libxml-tree.html#xmlParserCtxt
 [2017-07-26 20:44 UTC] paul at sparrowhawkcomputing dot com
Wow!  I had no idea the PHP sources were up on github.  Thanx.  That's very helpful in figuring out whether something is a PHP bug or a libxml bug.  Thanx for the pointer.

I just found another case that muddies the waters with regards to what libxml considers a well-formedness error.

libxml_use_internal_errors( true );
libxml_clear_errors();
$dom = new DOMDocument();
$dom->loadXML( '<root xmlns:xml="urn:foo"/>' );
$errs = libxml_get_errors();
var_dump( $errs );

produces:

array(1) {
  [0]=>
  object(LibXMLError)#260 (6) {
    ["level"]=>
    int(2)
    ["code"]=>
    int(200)
    ["column"]=>
    int(16)
    ["message"]=>
    string(41) "xml namespace prefix mapped to wrong URI
"
    ["file"]=>
    string(0) ""
    ["line"]=>
    int(1)
  }
}

So, libxml doesn't consider that a "fatal" error even tho that instance is unambiguously not well-formed [1].

So, please leave this ticket open while I dig more into exactly what libxml does and does not consider to be a well-formedness error.

[1] https://www.w3.org/TR/REC-xml-names/#xmlReserved
 [2017-07-27 01:24 UTC] requinix@php.net
I've seen enough of PHP's DOM code now that I believe any bugs found along these lines will not be with PHP itself. Which is to say, if there is such a bug (or other sort of incorrect or undesired behavior) then it would be in libxml.

We don't manage libxml, obviously, and we don't track their bugs. Just ours. And since PHP appears to be behaving correctly so far I'm going to leave this as NAB. If you find something that looks like a problem in PHP then I'll look too, but otherwise this bug tracker isn't really the best place to conduct a deep dive into libxml or how PHP interacts with it.
http://www.php.net/support.php
 
PHP Copyright © 2001-2020 The PHP Group
All rights reserved.
Last updated: Sun Oct 25 12:01:25 2020 UTC