php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #74036 createTextNode() docs unclear about binary data and metacharacters
Submitted: 2017-02-02 17:08 UTC Modified: 2017-02-02 18:37 UTC
From: judge2005 at gmail dot com Assigned:
Status: Verified Package: DOM XML related
PHP Version: 7.0.15 OS: RHEL 7
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: judge2005 at gmail dot com
New email:
PHP Version: OS:

 

 [2017-02-02 17:08 UTC] judge2005 at gmail dot com
Description:
------------
If binary data is passed to createTextNode() it can cause invalid XML to be generated.

Test script:
---------------
<?php
$document  = new DOMDocument('1.0', 'UTF-8');
$document->formatOutput = true;

$root = $document->createElement('example');
$document->appendChild($root);

$example = $document->createTextNode("PK");	
$root->appendChild($example);

echo $document->saveXML();

Actual result:
--------------
<?xml version="1.0" encoding="UTF-8"?>
<example>PK</example>


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2017-02-02 17:10 UTC] judge2005 at gmail dot com
Looks like your bug report system strips out the binary. The additional characters were ETX, EOT and DC4
 [2017-02-02 17:55 UTC] requinix@php.net
I don't think this is a bug. Though DOM 3 doesn't say much, from what I can gather createTextNode should not try to sanitize the text input. And though &"'<> will be escaped during serialization, according to context, that's all and it's up to the developer to not do things like use control characters.

Testing with Javascript in Chrome does the same thing: the characters are left as-is and not escaped. Adding it to an XML document, serializing to a string, then parsing the string results in a parse error.

And the bug system didn't strip them. The characters are there, your browser is just not rendering them as anything.
 [2017-02-02 18:10 UTC] judge2005 at gmail dot com
It is a tricky one for me. The input to the method is arbitrary as it is externally generated. If everyone who uses the createTextNode() method is forced to sanitize the input, it kills some of the benefit of createTextNode(), which is that it sanitizes most things, just not low ascii values. Everyone would have to add code to perform the sanitization. In addition - in this case - the call is in a third party library (PHPUnit to be precise). I have submitted a bug report to them too, however there may be many third-party libraries that also use this method. At a minimum the documentation should point out this problem and maybe point the reader at createCDATASection (though I haven't tried that, so I don't know if it handles this). But given that it already performs sanitization on most of the input, it is arguable that it should handle this case too.
 [2017-02-02 18:37 UTC] requinix@php.net
-Summary: createTextNode() does not handle binary data +Summary: createTextNode() docs unclear about binary data and metacharacters -Status: Open +Status: Verified -Type: Bug +Type: Documentation Problem
 [2017-02-02 18:37 UTC] requinix@php.net
The only thing a developer needs to account for is that the text contains characters/byte sequences that are valid for the target XML document. That means no \x00-\x1F (besides whitespace) and that it uses the correct character encoding. I don't think that's an undue burden.

Meanwhile & ' " < > are handled appropriately according to when the text node is serialized, which is something explicitly mentioned in DOM 2/3.

However createTextNode's documentation could use a little more explanation about what's going on as apparently there's been confusion about whether/how text is escaped. For example, as I mentioned earlier escaping does happen, and it does so at serialization rather than when the text node is created. https://3v4l.org/DQh17

Though not handling <= \x1F is unfortunate, I think it would be best to match existing widespread implementations.
 
PHP Copyright © 2001-2019 The PHP Group
All rights reserved.
Last updated: Wed Aug 21 15:01:27 2019 UTC