php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #44367 DOMDocument::baseURI parsing is out of whack
Submitted: 2008-03-08 05:09 UTC Modified: 2008-03-12 22:30 UTC
From: daniel dot oconnor at gmail dot com Assigned: rrichards (profile)
Status: Not a bug Package: DOM XML related
PHP Version: 5.2.5 OS: Windows
Private report: No CVE-ID: None
 [2008-03-08 05:09 UTC] daniel dot oconnor at gmail dot com
Description:
------------
The W3C clarified a few xml:base issues when publishing the GRDDL spec.

You can see the tests at http://www.w3.org/TR/grddl-tests/#ambiguous-infoset.

Basically:
 * DOMDocument::loadXML does not detect xml:base attributes
 * simplexml_load_file does not detect xml:base attributes (or they are lost during the importNode phase)
 * simplexml_load_string does not detect xml:base attributes (or they are lost during the importNode phase)
 * DOMDocument does not deal with nested xml:base
 * DOMDocument does not deal with redirected xml:base locations

To clarify on the redirect-xml:base stuff...

If I request http://foo.com/example.xml
and that redirects me to http://bar.com/example.xml
and bar.com/example.xml said xml:base = http://foo.com/example.xml

... then http://bar.com/example.xml's baseURI should be http://bar.com/example.xml

Reproduce code:
---------------
<?php
$url = 'http://www.w3.org/2001/sw/grddl-wg/td/base/xmlWithBase.xml';
$xml = file_get_contents($url);

//Load a url
$doc = DOMDocument::load($url);
var_dump($doc->baseURI);    //Expected http://www.w3.org/2001/sw/grddl-wg/td/base/xmlWithBase.xml

//Load an xml document with xml:base
$doc = DOMDocument::loadXML($xml);
var_dump($doc->baseURI);    //Expected http://www.w3.org/2001/sw/grddl-wg/td/base/xmlWithBase.xml



//Does it work with importNode?
$sxe = simplexml_load_file($url);
$dom_sxe = dom_import_simplexml($sxe);

$dom = new DOMDocument('1.0');
$dom_sxe = $dom->importNode($dom_sxe, true);
$dom_sxe = $dom->appendChild($dom_sxe);
var_dump($doc->baseURI);    //Expected (maybe) http://www.w3.org/2001/sw/grddl-wg/td/base/xmlWithBase.xml

// Alternative?
$sxe = simplexml_load_string($xml);
$dom_sxe = dom_import_simplexml($sxe);

$dom = new DOMDocument('1.0');
$dom_sxe = $dom->importNode($dom_sxe, true);
$dom_sxe = $dom->appendChild($dom_sxe);
var_dump($doc->baseURI);   //Expected (maybe) http://www.w3.org/2001/sw/grddl-wg/td/base/xmlWithBase.xml



//What about documents with an invalid xml:base (not on the top level element)?
$doc = DOMDocument::load('http://www.w3.org/2001/sw/grddl-wg/td/inline-rdf6.xml');
var_dump($doc->baseURI);    //Expected http://wwww.example.org/

//What about documents with a *redirected xml:base* ?
//Note: this test case is a little broken because of a W3C server change - it *should* redirect to 'http://www.w3.org/2001/sw/grddl-wg/td/base/xmlWithBase.xml'
//      and thus have a funky new xml:base value
$doc = DOMDocument::load('http://www.w3.org/2001/sw/grddl-wg/td/xmlWithBase.xml');
var_dump($doc->baseURI);    //Expected http://www.w3.org/2001/sw/grddl-wg/td/base/xmlWithBase.xml

Expected result:
----------------
See reproduce code

Actual result:
--------------
See reproduce code

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2008-03-08 22:20 UTC] johannes@php.net
Rob, please take a look
 [2008-03-10 14:09 UTC] rrichards@php.net
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

Don't know about GRDDL, but for DOM trees, base uri of a DOMDocument is 
the URI its loaded from (or for memory based tree, the current dir).
You need to check on the document element to get the base uri you are 
looking for.
 [2008-03-11 00:03 UTC] daniel dot oconnor at gmail dot com
See http://www.w3.org/TR/grddl/#base_misc & http://www.apps.ietf.org/rfc/rfc3986.html#sec-5.1

The way to determine baseURI is:
 1. Look for it on the root document element (HTML - <base>, XML - <foo xml:base="">
 2. Couldn't find that? Use the URL we retrieved the document with
     * And make sure we follow redirects!
 3. Couldn't find that? Application specific (but we don't really have a setBaseURI())

So, condition #1 is broken in 5.2.5 when you do:

<?php
$doc =
DOMDocument::load('http://www.w3.org/2001/sw/grddl-wg/td/inline-rdf6.xml');
var_dump($doc->baseURI);    //Expected http://wwww.example.org/

produces:
string(53) "http://www.w3.org/2001/sw/grddl-wg/td/inline-rdf6.xml"
 [2008-03-12 17:16 UTC] rrichards@php.net
still bogus as what you are describing pertains to GRDDL only not DOM, 
so when working with GRDDL and DOm you need to check base uri of the 
document element, not the DOMDocument.
DOM determines base uri using the xml base spec.

"The base URI of a document entity or an external entity is determined 
by RFC 2396 rules, namely, that the base URI is the URI used to retrieve 
the document entity or external entity."

This is not just how it is implemented in PHP as the other major DOM 
parsers implement it the same way,
 [2008-03-12 22:30 UTC] daniel dot oconnor at gmail dot com
:S I hate being pushy / argumentitive, sorry if its coming across that way.


RFC 2396 is "Uniform Resource Identifiers (URI): Generic Syntax"

Section 5.1. is "Establishing a Base URI" describes what I've been trying to say, probably a little clearer.



XML Base spec @ http://www.w3.org/TR/xmlbase/#rfc2396 says:

Determine a baseURI:
 1. The base URI is embedded in the document's content.
 2. The base URI is that of the encapsulating entity (message, document, or none).
 3. The base URI is the URI used to retrieve the entity.
 4. The base URI is defined by the context of the application.




> This is not just how it is implemented in PHP as the other major DOM parsers implement it the same way

... and that's why the xml:base GRDDL tests were written - to clarify correct behaviour / check implementations.
 [2013-02-15 09:52 UTC] sites at hubmed dot org
A test case which illustrates that the baseURI parsing is working correctly now (at least in PHP 5.3.15):

<?php
$doc = DOMDocument::load('http://www.w3.org/2001/sw/grddl-wg/td/inline-rdf6.xml');

var_dump($doc->baseURI); // "http://www.w3.org/2001/sw/grddl-wg/td/inline-rdf6.xml"

var_dump($doc->documentElement->baseURI); // "http://wwww.example.org/"

As http://www.w3.org/TR/xmlbase/ describes, the base URI of a document entity is the URI used to retrieve the document entity. The base URI of an element (including the document element) is 
detected by various rules, starting with the xml:base attribute on the element.
 [2013-07-14 12:53 UTC] hanskrentel at yahoo dot de
Please take care that PHP's DOMDocument does not offer the DOM CORE Level 3 
feature at all. So whatever the specs of that DOM Core Level say, nothing - 
absolutely nothing - allows to draw the conclusion that this (perhaps by accident 
same named property) is an implementation of DOM Core Level 3.

PHP's DOMDocument has only DOM Core Level 1 feature which does not cover this 
property.

All references to XML Infoset in this ticket are therefore completely bogus.
 [2013-07-31 05:56 UTC] mike at skew dot org
I submitted a couple of related feature requests:

Request #65364 - In doc not loaded from a URL, baseURI should still be a real URI
Request #65365 - Allow defining baseURI of doc not loaded directly from URL
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 19:01:29 2024 UTC