php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #63189 External DTDs are not processed
Submitted: 2012-09-29 19:56 UTC Modified: 2012-10-16 17:59 UTC
From: vl dot homutov at gmail dot com Assigned:
Status: Not a bug Package: *XML functions
PHP Version: 5.4.7 OS: Linux
Private report: No CVE-ID: None
 [2012-09-29 19:56 UTC] vl dot homutov at gmail dot com
Description:
------------
PHP's xml_parse() ignores external DTD specified in the
XML file and thus can't parse the file if it has
unknown entities (defined in the DTD mentioned).


Test script:
---------------
#!/usr/bin/php
<?php

$xml_ext_dtd=<<<EOXML
<?xml version="1.0"?>
<!DOCTYPE mytag SYSTEM "./mytag.dtd">
<mytag><elem>one</elem><elem>two</elem><elem>&custom;</elem>/mytag>
EOXML;

$xml_int_dtd=<<<EOXML
<?xml version="1.0"?>
<!DOCTYPE mytag
[
<!ENTITY custom SYSTEM "file.txt">
]>
<mytag><elem>one</elem><elem>two</elem><elem>&custom;</elem>/mytag>
EOXML;

function externalEntityHandler($parser, $name, $base, $systemId, $publicId)
{
	echo "PROCESS EXTERNAL REFERENCE(file=$systemId)\n";
	return true;
}

function characterDataHandler($parser, $data)
{
	echo "CDATA found: '$data'\n";
}

function xerr($parser)
{
	$out = "XML parser error:";
	$out.=xml_error_string(xml_get_error_code($parser));
	$out.="\n";
	return $out;
}

echo "This works OK - parse xml1:\n$xml_int_dtd\n";
echo "---------------------------------------\n";
$xml_parser = xml_parser_create();
xml_set_character_data_handler($xml_parser, "characterDataHandler");
xml_set_external_entity_ref_handler($xml_parser, "externalEntityHandler");
xml_parse($xml_parser, $xml_int_dtd) or die(xerr($xml_parser));

echo "\nThis FAILS - parse xml2:\n$xml_ext_dtd\n";
echo "---------------------------------------\n";
$xml_parser = xml_parser_create();
xml_set_character_data_handler($xml_parser, "characterDataHandler");
xml_set_external_entity_ref_handler($xml_parser, "externalEntityHandler");
$rv = xml_parse($xml_parser, $xml_ext_dtd);
if (!$rv) echo xerr($xml_parser);

echo "file 'mytag.dtd' is:\n".file_get_contents("./mytag.dtd");

?>

Expected result:
----------------
This works OK - parse xml1:
<?xml version="1.0"?>
<!DOCTYPE mytag
[
<!ENTITY custom SYSTEM "file.txt">
]>
<mytag><elem>one</elem><elem>two</elem><elem>&custom;</elem>/mytag>
---------------------------------------
CDATA found: 'one'
CDATA found: 'two'
PROCESS EXTERNAL REFERENCE(file=file.txt)


Actual result:
--------------
This FAILS - parse xml2:
<?xml version="1.0"?>
<!DOCTYPE mytag SYSTEM "./mytag.dtd">
<mytag><elem>one</elem><elem>two</elem><elem>&custom;</elem>/mytag>
---------------------------------------
CDATA found: 'one'
CDATA found: 'two'
XML parser error:Undeclared entity warning
file 'mytag.dtd' is:
<!ENTITY custom SYSTEM "file.txt">


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2012-09-30 21:04 UTC] vl dot homutov at gmail dot com
Additional details:

There is also problem if custom entity is present in the attribute:

<?xml version="1.0"?>
<!DOCTYPE mytag [<!ENTITY custom SYSTEM "file.txt">]>
<mytag attr="&custom;"><elem>one</elem><elem>two</elem><elem>&custom;</elem></mytag>

gives: XML parser error:XML_ERR_ENTITY_IS_EXTERNAL
 [2012-10-16 15:46 UTC] cataphract@php.net
-Status: Open +Status: Not a bug
 [2012-10-16 15:46 UTC] cataphract@php.net
This is not a bug, the external subset is handled separately in libxml2.

See http://lxr.php.net/xref/THIRD_PARTY/libxml2/parser.c#xmlParseDocTypeDecl , 
this is where the doctype is parsed and the external dtd triggers the call of 
the  callback "internalSubset", which we do not hook internally, and which is 
therefore not hookable in userland too. See 
http://lxr.php.net/xref/PHP_TRUNK/ext/xml/compat.c#php_xml_compat_handlers

The internal subset is processed elsewhere in xmlParseInternalSubset() and 
doesn't depend on a SAX callback.

More generally, see http://xmlsoft.org/entities.html :

> WARNING: handling entities on top of the libxml2 SAX interface is difficult!!! 
If you plan to use non-predefined entities in your documents, then the learning 
curve to handle then using the SAX API may be long. If you plan to use complex 
documents, I strongly suggest you consider using the DOM interface instead and 
let libxml deal with the complexity rather than trying to do it yourself.
 [2012-10-16 17:59 UTC] vl dot homutov at gmail dot com
it would be nice to document
all the limitations that this parser
has.
It's a bit pity to find out that despite having
xml_set_external_entity_ref_handler() it doesn't work
as expected.

Another issue I've found recently is that newlines are
not preserved in attributes.

i.e. <some_tag id="Some aligned
                   long text here
                   like long title"
               more="more_attrs">

will read as atrrs[id]="Some aligned        long text here       like..."

The workaround is to parse manually external DTD
and inject file contents it points to into parsed file,
so all entites get loaded. ugly, yes, but works for me.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Wed Sep 11 13:01:28 2024 UTC