|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2006-06-21 20:11 UTC] brandenrauch at gmail dot com
Description:
------------
For my project my data is passing through both xml and xsl. I've chosen to use decimal (ascII) entities--ex: "--0for input such as quotes ("), singles quotes ('), less thans (<), greater thans(>), and ampersands (&).
However, when I load my xml into dom it automatically transforms these characters into either their natural ascII form (specifically quotes), or an html entity. These transformations are made regardless of the substituteEntities boolean setting in the DOMDocument object.
Reproduce code:
---------------
$text = '<xml><text><tag></text><text>"quotes"</text></xml>';
$dom = new DOMDocument();
$dom->substituteEntities = false;
$dom->loadXML($text);
echo $dom->saveHTML();
Expected result:
----------------
<xml><text><tag></text><text>"quotes"</text></xml>
Actual result:
--------------
<xml><text><tag></text><text>"quotes"</text></xml>
PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Sun Oct 26 15:00:01 2025 UTC |
I'm seeing the same behavior in attributes, i get a difference between input (load) and output (save). input: svg:font-family="'Courier New'" output: svg:font-family="'Courier New'" With this i'm unable to write out an XML without any modifications. This is important if you just want to change a small part of the XML and not change escaping everywhere. The following code shows this behavior for attributes: ###################################################### <?php $xml = <<<XML <?xml version="1.0" encoding="UTF-8"?> <office:document-content xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0" xmlns:style="urn:oasis:names:tc:opendocument:xmlns:style:1.0" office:version="1.2"> <style:font-face style:name="Courier New" svg:font-family="'Courier New'" style:font-family-generic="modern" style:font-pitch="fixed"/> </office:document-content> XML; $doc = new \DOMDocument(); $doc->loadXML($xml); $doc->substituteEntities = false; printf("%s\n", $doc->saveXML()); ###################################################### It may well be that libxml substituteEntities does not provide any change to this situation. However i do found a few points that indicate that it doesn't _have_ to be this way. 1. LibreOffice 5.4.3.2 saves ODT files with xml which look similar to the input of the test script. Not that LibreOffice makes the XML standard .. but it's a pretty big player so it counts for something. 2. When i read the standard (which is now located at https://www.w3.org/TR/xml/#sec-predefined-ent ), the first line i read "Entity and character references may both be used to escape the left angle bracket, ampersand, and other delimiters.". So i understand entities like ' may be used. In other words: seems more libxml related and not against the standard. 3. This interesting SO answer https://stackoverflow.com/a/10064066 which suggest entity substitution can be controlled with character encoding. I couldn't find the encoding "HTML, a specific handler for the conversion of UTF-8 to ASCII with HTML predefined entities like © for the Copyright sign." mentioned here http://xmlsoft.org/encoding.html I suspect this particular encoding is not exposed by PHP. I'm also not sure whether it's possible to create a user-land solution that traverses the DOM and sets the right characters for attributes and values. My guess is it's not possible to do this when you still want to use DOMDocument::save* functions. Maybe this additional information will help someone else along .. myself i'm still stuck and i will use string replace (or regex) on the output xml to change it back to the way LibreOffice outputs the xml.