php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #68437 saveXML produces invalid xml
Submitted: 2014-11-17 18:50 UTC Modified: 2015-08-18 20:53 UTC
Votes:2
Avg. Score:4.5 ± 0.5
Reproduced:2 of 2 (100.0%)
Same Version:2 (100.0%)
Same OS:2 (100.0%)
From: lphp-bug at thax dot hardliners dot org Assigned:
Status: Open Package: DOM XML related
PHP Version: master-Git-2014-11-17 (Git) OS:
Private report: No CVE-ID: None
 [2014-11-17 18:50 UTC] lphp-bug at thax dot hardliners dot org
Description:
------------
Bug #54214 has been closed as Bogus, but gets the facts wrong:

1. XML (even the 1998 working drafts) defines valid chars as:
 Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

This contradicts #54214 that Char ::= [#x1-#xD7FF] | ... 
Thus, 0x0b (or 0x06) are actually not in the valid Range.

2. libxml2 "problem" is not saving low-ascii chars, but loading them: loadXML(saveXML()) ALWAYS fails, when e.g. DOMText with low-ascii is present:

  DOMDocument::loadXML(): xmlParseCharRef: invalid xmlChar value 11 in Entity

3. saveXML() does not always warn. 
More specifically, it only warns when 
  (*) $doc has no encoding AND
  (**) saveXML() is called without arguments.

4. When saveXML() does warn, it will remove the offending character (good), but not when it does not warn.

Background:
- saveXML() uses xmlDocDumpFormatMemory(), which uses document encoding, when given. Otherwise ascii encoding (0x00 ... 0x7f only) is assumed, and xmlEscapeEntities (libxml2/tree/xmlsave.c, line 208) is used as escape-function.
This function warns and ignores the offending character
- saveXML($el) uses xmlNodeDump(), which passes encoding==NULL to xmlNodeDumpOuput, which converts NULL to "UTF-8". Then xmlSaveCtxtInit() will not set the escape-function.

When the escape-function is not set on the xmlSaveCtxt, xmlOutputBufferWriteEscape will use xmlEscapeContent (libxml2/tree/xmlIO.c, line 3536) instead. This function does not warn, and happily includes 0x0b in the output (where finally loadXML chokes on it).

Libxml never says that it will warn for low-ascii chars. Thus php-developers can't depend on it.
OTOH, it's unfeasible to force every php-programmer to pre-sanitize strings sent into DOMText, or post-check the saved xml string. Therefore php has to provide the necessary means to avoid invalid xmls -- i.e. at the very least warn, or even better: throw already in DOMText.

  


Test script:
---------------
$doc=new DOMDocument('1.0','utf-8'); // or: $doc=new DOMDocument('1.0'); (*)

$el=$doc->createElement('root');
$doc->appendChild($el);

$el->appendChild(new DOMText("\x0b asdf"));

$str=$doc->saveXML($el); // or: $doc->saveXML()  (**)

var_dump($str);

$doc->loadXML($str);


Expected result:
----------------
loadXML shall load the saved xml, or throw an exception earlier.

Actual result:
--------------
loadXML only works when (*) and (**) are used, i.e. when saveXML outputs a warning.

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2015-08-18 20:53 UTC] cmb@php.net
> XML (even the 1998 working drafts) defines valid chars as:
> Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

That was XML 1.0[1]. However, XML 1.1[2] defines:

Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

[1] <http://www.w3.org/TR/REC-xml/#charsets>
[2] <http://www.w3.org/TR/xml11/#charsets>
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Mon Oct 07 11:01:28 2024 UTC