php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #37878 Dom Automatically Replaces ASCII Entities Regardless of substituteEntities
Submitted: 2006-06-21 20:11 UTC Modified: 2006-06-23 02:38 UTC
From: brandenrauch at gmail dot com Assigned: rrichards (profile)
Status: Not a bug Package: DOM XML related
PHP Version: 5.1.4 OS: XP
Private report: No CVE-ID: None
 [2006-06-21 20:11 UTC] brandenrauch at gmail dot com
Description:
------------
For my project my data is passing through both xml and xsl. I've chosen to use decimal (ascII) entities--ex: &#34;--0for input such as quotes ("), singles quotes ('), less thans (<), greater thans(>), and ampersands (&).

However, when I load my xml into dom it automatically transforms these characters into either their natural ascII form (specifically quotes), or an html entity. These transformations are made regardless of the substituteEntities boolean setting in the DOMDocument object.

Reproduce code:
---------------
$text = '<xml><text>&#60;tag&#62;</text><text>&#34;quotes&#34;</text></xml>';

$dom = new DOMDocument();
$dom->substituteEntities = false;

$dom->loadXML($text);

echo $dom->saveHTML();

Expected result:
----------------
<xml><text>&#60;tag&#62;</text><text>&#34;quotes&#34;</text></xml>

Actual result:
--------------
<xml><text>&lt;tag&gt;</text><text>"quotes"</text></xml>

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2006-06-22 19:32 UTC] tony2001@php.net
Assigned to the maintainer.
 [2006-06-23 02:38 UTC] rrichards@php.net
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

Behavior is corret - These are pre-defined entities and substituteEntities has no effect on the behavior of them.
See the specs for more info: http://www.w3.org/TR/2004/REC-xml-20040204/#sec-predefined-ent
 [2018-01-02 13:31 UTC] flip101 at gmail dot com
I'm seeing the same behavior in attributes, i get a difference between input (load) and output (save).

input: svg:font-family="&apos;Courier New&apos;"
output: svg:font-family="'Courier New'"

With this i'm unable to write out an XML without any modifications. This is important if you just want to change a small part of the XML and not change escaping everywhere.

The following code shows this behavior for attributes:
######################################################
<?php

$xml = <<<XML
<?xml version="1.0" encoding="UTF-8"?>
<office:document-content xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0" xmlns:style="urn:oasis:names:tc:opendocument:xmlns:style:1.0" office:version="1.2">

  <style:font-face style:name="Courier New" svg:font-family="&apos;Courier New&apos;" style:font-family-generic="modern" style:font-pitch="fixed"/>

</office:document-content>
XML;

$doc = new \DOMDocument();
$doc->loadXML($xml);
$doc->substituteEntities = false;
printf("%s\n", $doc->saveXML());
######################################################

It may well be that libxml substituteEntities does not provide any change to this situation. However i do found a few points that indicate that it doesn't _have_ to be this way.

1. LibreOffice 5.4.3.2 saves ODT files with xml which look similar to the input of the test script. Not that LibreOffice makes the XML standard .. but it's a pretty big player so it counts for something.
2. When i read the standard (which is now located at https://www.w3.org/TR/xml/#sec-predefined-ent ), the first line i read "Entity and character references may both be used to escape the left angle bracket, ampersand, and other delimiters.". So i understand entities like &apos; may be used. In other words: seems more libxml related and not against the standard.
3. This interesting SO answer https://stackoverflow.com/a/10064066 which suggest entity substitution can be controlled with character encoding.

I couldn't find the encoding "HTML, a specific handler for the conversion of UTF-8 to ASCII with HTML predefined entities like &copy; for the Copyright sign." mentioned here http://xmlsoft.org/encoding.html I suspect this particular encoding is not exposed by PHP. I'm also not sure whether it's possible to create a user-land solution that traverses the DOM and sets the right characters for attributes and values. My guess is it's not possible to do this when you still want to use DOMDocument::save* functions.

Maybe this additional information will help someone else along .. myself i'm still stuck and i will use string replace (or regex) on the output xml to change it back to the way LibreOffice outputs the xml.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Apr 19 00:01:29 2024 UTC