php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #68437 saveXML produces invalid xml
Submitted: 2014-11-17 18:50 UTC Modified: 2015-08-18 20:53 UTC
Votes:2
Avg. Score:4.5 ± 0.5
Reproduced:2 of 2 (100.0%)
Same Version:2 (100.0%)
Same OS:2 (100.0%)
From: lphp-bug at thax dot hardliners dot org Assigned:
Status: Open Package: DOM XML related
PHP Version: master-Git-2014-11-17 (Git) OS:
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: lphp-bug at thax dot hardliners dot org
New email:
PHP Version: OS:

 

 [2014-11-17 18:50 UTC] lphp-bug at thax dot hardliners dot org
Description:
------------
Bug #54214 has been closed as Bogus, but gets the facts wrong:

1. XML (even the 1998 working drafts) defines valid chars as:
 Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

This contradicts #54214 that Char ::= [#x1-#xD7FF] | ... 
Thus, 0x0b (or 0x06) are actually not in the valid Range.

2. libxml2 "problem" is not saving low-ascii chars, but loading them: loadXML(saveXML()) ALWAYS fails, when e.g. DOMText with low-ascii is present:

  DOMDocument::loadXML(): xmlParseCharRef: invalid xmlChar value 11 in Entity

3. saveXML() does not always warn. 
More specifically, it only warns when 
  (*) $doc has no encoding AND
  (**) saveXML() is called without arguments.

4. When saveXML() does warn, it will remove the offending character (good), but not when it does not warn.

Background:
- saveXML() uses xmlDocDumpFormatMemory(), which uses document encoding, when given. Otherwise ascii encoding (0x00 ... 0x7f only) is assumed, and xmlEscapeEntities (libxml2/tree/xmlsave.c, line 208) is used as escape-function.
This function warns and ignores the offending character
- saveXML($el) uses xmlNodeDump(), which passes encoding==NULL to xmlNodeDumpOuput, which converts NULL to "UTF-8". Then xmlSaveCtxtInit() will not set the escape-function.

When the escape-function is not set on the xmlSaveCtxt, xmlOutputBufferWriteEscape will use xmlEscapeContent (libxml2/tree/xmlIO.c, line 3536) instead. This function does not warn, and happily includes 0x0b in the output (where finally loadXML chokes on it).

Libxml never says that it will warn for low-ascii chars. Thus php-developers can't depend on it.
OTOH, it's unfeasible to force every php-programmer to pre-sanitize strings sent into DOMText, or post-check the saved xml string. Therefore php has to provide the necessary means to avoid invalid xmls -- i.e. at the very least warn, or even better: throw already in DOMText.

  


Test script:
---------------
$doc=new DOMDocument('1.0','utf-8'); // or: $doc=new DOMDocument('1.0'); (*)

$el=$doc->createElement('root');
$doc->appendChild($el);

$el->appendChild(new DOMText("\x0b asdf"));

$str=$doc->saveXML($el); // or: $doc->saveXML()  (**)

var_dump($str);

$doc->loadXML($str);


Expected result:
----------------
loadXML shall load the saved xml, or throw an exception earlier.

Actual result:
--------------
loadXML only works when (*) and (**) are used, i.e. when saveXML outputs a warning.

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2015-08-18 20:53 UTC] cmb@php.net
> XML (even the 1998 working drafts) defines valid chars as:
> Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

That was XML 1.0[1]. However, XML 1.1[2] defines:

Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

[1] <http://www.w3.org/TR/REC-xml/#charsets>
[2] <http://www.w3.org/TR/xml11/#charsets>
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 19:01:29 2024 UTC