php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #74036 createTextNode() docs unclear about binary data and metacharacters
Submitted: 2017-02-02 17:08 UTC Modified: 2017-02-02 18:37 UTC
From: judge2005 at gmail dot com Assigned:
Status: Verified Package: DOM XML related
PHP Version: 7.0.15 OS: RHEL 7
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: judge2005 at gmail dot com
New email:
PHP Version: OS:

 

 [2017-02-02 17:08 UTC] judge2005 at gmail dot com
Description:
------------
If binary data is passed to createTextNode() it can cause invalid XML to be generated.

Test script:
---------------
<?php
$document  = new DOMDocument('1.0', 'UTF-8');
$document->formatOutput = true;

$root = $document->createElement('example');
$document->appendChild($root);

$example = $document->createTextNode("PK");	
$root->appendChild($example);

echo $document->saveXML();

Actual result:
--------------
<?xml version="1.0" encoding="UTF-8"?>
<example>PK</example>


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2017-02-02 17:10 UTC] judge2005 at gmail dot com
Looks like your bug report system strips out the binary. The additional characters were ETX, EOT and DC4
 [2017-02-02 17:55 UTC] requinix@php.net
I don't think this is a bug. Though DOM 3 doesn't say much, from what I can gather createTextNode should not try to sanitize the text input. And though &"'<> will be escaped during serialization, according to context, that's all and it's up to the developer to not do things like use control characters.

Testing with Javascript in Chrome does the same thing: the characters are left as-is and not escaped. Adding it to an XML document, serializing to a string, then parsing the string results in a parse error.

And the bug system didn't strip them. The characters are there, your browser is just not rendering them as anything.
 [2017-02-02 18:10 UTC] judge2005 at gmail dot com
It is a tricky one for me. The input to the method is arbitrary as it is externally generated. If everyone who uses the createTextNode() method is forced to sanitize the input, it kills some of the benefit of createTextNode(), which is that it sanitizes most things, just not low ascii values. Everyone would have to add code to perform the sanitization. In addition - in this case - the call is in a third party library (PHPUnit to be precise). I have submitted a bug report to them too, however there may be many third-party libraries that also use this method. At a minimum the documentation should point out this problem and maybe point the reader at createCDATASection (though I haven't tried that, so I don't know if it handles this). But given that it already performs sanitization on most of the input, it is arguable that it should handle this case too.
 [2017-02-02 18:37 UTC] requinix@php.net
-Summary: createTextNode() does not handle binary data +Summary: createTextNode() docs unclear about binary data and metacharacters -Status: Open +Status: Verified -Type: Bug +Type: Documentation Problem
 [2017-02-02 18:37 UTC] requinix@php.net
The only thing a developer needs to account for is that the text contains characters/byte sequences that are valid for the target XML document. That means no \x00-\x1F (besides whitespace) and that it uses the correct character encoding. I don't think that's an undue burden.

Meanwhile & ' " < > are handled appropriately according to when the text node is serialized, which is something explicitly mentioned in DOM 2/3.

However createTextNode's documentation could use a little more explanation about what's going on as apparently there's been confusion about whether/how text is escaped. For example, as I mentioned earlier escaping does happen, and it does so at serialization rather than when the text node is created. https://3v4l.org/DQh17

Though not handling <= \x1F is unfortunate, I think it would be best to match existing widespread implementations.
 [2022-12-19 08:34 UTC] amin dot jabari242 at gmail dot com
Your article content is extremely fascinating.  (https://www.mycoverageinfo.us/)github.com
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 19:01:29 2024 UTC