php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #76672 Vertical Tab char in the value of DOMElement::setAttribute produces invalid xml
Submitted: 2018-07-27 07:30 UTC Modified: 2018-07-27 10:29 UTC
From: mioshchikhes at jobrouter dot de Assigned:
Status: Not a bug Package: XML related
PHP Version: 7.1.20 OS: Windows
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: mioshchikhes at jobrouter dot de
New email:
PHP Version: OS:

 

 [2018-07-27 07:30 UTC] mioshchikhes at jobrouter dot de
Description:
------------
Vertical Tab character \v in the value of DOMElement::setAttribute produces invalid xml.
The XML can be saved, but it has an invalid character and cannot be read correctly. The behavior is reproducible.

Test script:
---------------
<?php
$dom = new DOMDocument();
$node = $dom->createElement('my-test');
$node->setAttribute('my-attribute', "vertical\vtabs");
$dom->appendChild($node);

$text = $dom->saveXML();
var_dump($text);

$dom = new DOMDocument();
$dom->loadXML($text);
var_dump($dom->saveXML());

Expected result:
----------------
Either an error or valid XML

Actual result:
--------------
An invalid xml is produced

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2018-07-27 08:42 UTC] requinix@php.net
-Status: Open +Status: Analyzed -Package: DOM XML related +Package: XML related
 [2018-07-27 08:42 UTC] requinix@php.net
Vertical tabs aren't allowed in XML 1.0 and need to thus be escaped as &#11; or &#xB;. They are allowed in XML 1.1 but are discouraged as one of the "compatibility characters". Now I'm not sure exactly which rules libxml follows, but at least in this aspect it seems to follow 1.0 when reading. And unfortunately libxml won't automatically escape them when writing.

PHP could, but the question is whether it should. At least whether it should *now*. On one hand, this changes the behavior of code people have been relying on for years, but on the other hand (a) the generated XML may not have been valid to begin with and (b) any parser that accepts invalid unescaped characters should accept the escaped versions transparently - even if humans don't realize they're the same.

The normal answer to this problem is "you have to escape it yourself" (especially when it comes to the addChild/createTextNode problem) but if there's an opportunity to reduce the number of times that's necessary then I'd like to see if we can.

I'm only moving this to Analyzed so someone more familiar with the XML side of PHP can decide what to do.
a) Not a bug, libxml follows XML 1.0 and you have to escape it yourself
b) Not a bug, libxml follows XML 1.1 and it is the one that can't handle \v not PHP
c) Is a bug, PHP should take some measures to escape strings automatically
 [2018-07-27 09:23 UTC] mioshchikhes at jobrouter dot de
In my opinion, it is a bug in libxml.
The characters \t \n \r are automatically converted in the method DOMElement::setAttribute correctly to &#9; &#10; &#13; 
but not \v
 [2018-07-27 09:36 UTC] requinix@php.net
That's necessary because a raw \nrt in an attribute gets normalized to a space. To keep the character when reading it must have been escaped when writing.
http://www.w3.org/TR/2006/REC-xml11-20060816#AVNormalize 3.3.3 Attribute-Value Normalization
They didn't include \v in there.
 [2018-07-27 09:48 UTC] mioshchikhes at jobrouter dot de
In this case I would expect the method to throws an exception if the input is invalid. But not that an invalid XML is created.
 [2018-07-27 10:29 UTC] requinix@php.net
-Status: Analyzed +Status: Not a bug
 [2018-07-27 10:29 UTC] requinix@php.net
I agree, it should never produce invalid XML. Perhaps unexpected markup sometimes (try putting "-->" into a comment) but at least it's valid.

After experimenting more with what libxml encodes and when, I'm satisfied that it's generally consistent: < > & are encoded as needed and normal characters are turned into entities when the document charset doesn't support them.

Control characters besides \nrt are the problem. They should be converted into entities as happens with any other unsupported character. And that's a libxml problem.
 [2018-07-27 22:47 UTC] a at b dot c dot de
Interesting you should mention that about "-->", because that leads to another libxml bug (if they consider it a bug):

<?php
$doc = new DOMDocument();
$node = $doc->appendChild(new DOMElement('my-test'));
$comment = $node->appendChild(new DOMComment("This comment has a double hyphen -- something that isn't allowed in XML."));

echo $doc->saveXML();
$doc->save('c:/tmp/test.xml');

?>
Produces invalid output.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sat Apr 20 03:01:28 2024 UTC