php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #15092 xml_parse() fails if XML-data contains entity like   or © ...
Submitted: 2002-01-17 21:03 UTC Modified: 2002-01-22 19:02 UTC
Votes:1
Avg. Score:5.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:1 (100.0%)
Same OS:1 (100.0%)
From: bs_php at infeer dot com Assigned:
Status: Closed Package: XML related
PHP Version: 4.1.0 OS: Win 2k (all I gues)
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: bs_php at infeer dot com
New email:
PHP Version: OS:

 

 [2002-01-17 21:03 UTC] bs_php at infeer dot com
PHP XML-parser has problems with the full iso8859-1 char set when trying to use entity names. E.g. the parser will fail with "undefined entity" if the XML data you parse contains   or © a.s.o. (there many more).

Some entities do work, like < > & as well as the alternative notation unsing the ISO-code number: like non-breaking space  ===   

For a full iso8859-1 list and it's entities see: http://www.ramsch.org/martin/uni/fmi-hp/iso8859-1.html

Here's the test script you can use to check the error :
<?php
$xmlString[0] = "<AAA>&#160;</AAA>";
$xmlString[1] = "<AAA>&nbsp;</AAA>";

  function startElement($xml_parser, $name, $attrs) {}
  function endElement($xml_parser, $name) {}
  function characterData($xml_parser, $text) {echo "Handling character data: '".htmlspecialchars($text)."'<br>";}
  
  $xml_parser = xml_parser_create();
  xml_set_element_handler($xml_parser, "startElement", "endElement");
  xml_set_character_data_handler($xml_parser,  "characterData");
  
  // Parse the XML data.
  if (!xml_parse($xml_parser, $xmlString[1], TRUE)) {
   echo "XML error in given {$source} on line ". xml_get_current_line_number($xml_parser) . 
        '  column ' . xml_get_current_column_number($xml_parser) .
        '. Reason:' . xml_error_string(xml_get_error_code($xml_parser));
  }
?>


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2002-01-22 09:49 UTC] bs_php at infeer dot com
After some more testes I found that the only literal entities that work are:  &amp;, &lt; &gt; and &quot;. 
*ALL* others (like &nbsp; &copy; a.s.o.) cause an XML_ERROR_UNDEFINED_ENTITY error.

The best work around to this problem, is to tranlate the entities  found in the XML source to theire numeric equivalent E.g. &nbsp; to &#160; / &copy; to &#169; a.s.o.
Following function will do the job:

  /**
  * Translate literal entities to their numeric equivalents and vice versa.
  *
  * PHP's XML parser (in V 4.1.0) has problems with entities! The only one's that are recognized
  * are &amp;, &lt; &gt; and &quot;. *ALL* others (like &nbsp; &copy; a.s.o.) cause an 
  * XML_ERROR_UNDEFINED_ENTITY error. I reported this as bug at http://bugs.php.net/bug.php?id=15092
  * The work around is to translate the entities found in the XML source to their numeric equivalent
  * E.g. &nbsp; to &#160; / &copy; to &#169; a.s.o.
  * 
  * NOTE: Entities &amp;, &lt; &gt; and &quot; are left 'as is'
  * 
  * @author Sam Blum bs_php@users.sourceforge.net
  * @param string $xmlSource The XML string
  * @param bool   $reverse (default=FALSE) Translate numeric entities to literal entities.
  * @return The XML string with translatet entities.
  */
  function _translateLiteral2NumericEntities($xmlSource, $reverse = FALSE) {
    static $literal2NumericEntity;
    
    if (empty($literal2NumericEntity)) {
      $transTbl = get_html_translation_table(HTML_ENTITIES);
      foreach ($transTbl as $char => $entity) {
        if (strpos('&"<>', $char) !== FALSE) continue;
        $literal2NumericEntity[$entity] = '&#'.ord($char).';';
      }
    }
    if ($reverse) {
      return strtr($xmlSource, array_flip($literal2NumericEntity));
    } else {
      return strtr($xmlSource, $literal2NumericEntity);
    }
  }




 [2002-01-22 11:34 UTC] chregu@php.net
This doesn't work, because the default entities are only:
<!ENTITY lt     "&#38;#60;"> 
<!ENTITY gt     "&#62;"> 
<!ENTITY amp    "&#38;#38;"> 
<!ENTITY apos   "&#39;"> 
<!ENTITY quot   "&#34;"> 

For the latin1-entities to work, you have to set an external entity to 
http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
(or some local file with that content)
and then set up a xml_set_external_entity_ref_handler().
See the details about that in the manual.

(set to feedback, 'cause I didn't really test it, if someone can verify that, we can close it.)

 [2002-01-22 16:49 UTC] bs_php at infeer dot com
Yes, that's also a possibility. 
  Not practical, 
  not easy, 
  not fancy, 
  not efficient... but a possibility.

---

Still, I think it's a bug:
According to the manual, PHP's XML parser *handles* ISO 8859-1 (Latin-1) as default. That means XML source may contain ?,?,?,?, a.s.o. 
So way should it fail when I intend to use the latin1-entities!? I see no reason way latin1-entities shouldn't work per default too!!
 [2002-01-22 18:08 UTC] chregu@php.net
It does not work, because the XML-standard (see http://www.w3.org/TR/1998/REC-xml-19980210#sec-predefined-ent) does not say, that an XML Parser should understand latin1-entities. The expat Parsers behaves like it is expected (sablotron and domxml for example do the same, they don't understand latin1-entities without external entities..). Understanding the iso8859-1 charset is something completely different and has not much to do with entities (in the context of the xml-standard..).

I agree with you, that it should be easier to include external entities, but the way ext/xml does is also just like a SAX-parser should do it -> do everything by yourself with callback functions... (i assume, that was the idea :) )
 [2002-01-22 18:46 UTC] bs_php at infeer dot com
Don't take following personal:
I think it's terrible habit to miss out practical functionality and point to a spec and say "It must be this way, the spec says so". 
I mean is there a *real* reason not to support the ISO latin-1 entities??
PHP slogan is to keep simple stuff simple, isn't it?
So is there a chance that the ISO latin-1 entities can be turned on in future (by parameter or so)?

 [2002-01-22 19:02 UTC] chregu@php.net
No worries, i'm not taking anything personal in here, but nevertheless i have to disagree with you. XML is all about specs... Interoperability is one of the main advantages of XML and if PHP would parse XML-Documents with latin1-entities but without a corresponding external entity, it breaks the specs and this XML-Doc would no other XML-Processor understand. And i think it'd be the wrong way, if PHP would parse non-valid XML-Documents, even with an optional parameter.. but anyway, i'm not the maintainer of this extension, therefore i don't have the last word in that :)


 [2004-01-20 14:51 UTC] somebody at wolfmarkt dot de
It is not ultimately necessary to add a reference to an external entity and to have add an external entity handler, separate parser etc. as described in other messages below
Just include the entity definitions you need at the top of your xml and thus make the xml parser aware of them:
<?xml version='1.0'?>
<!DOCTYPE demo SYSTEM "/demo.dtd [
<!ENTITY nbsp   "&#160;">
<!ENTITY iexcl  "&#161;">
<!ENTITY cent   "&#162;">
<---- more entities here, see list e.g. at http://www.w3.org/MarkUp/html-spec/html-spec_14.html ---->
<!ENTITY yuml    "&#255;">

]>
<demo>
<---- your XML doc here ---->
</demo>
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sun Dec 22 11:01:30 2024 UTC