php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #38538 loadHTML doesn't use XHTML namespace
Submitted: 2006-08-21 23:25 UTC Modified: 2006-08-22 09:34 UTC
Votes:4
Avg. Score:4.8 ± 0.4
Reproduced:4 of 4 (100.0%)
Same Version:0 (0.0%)
Same OS:2 (50.0%)
From: spam02 at pornel dot net Assigned:
Status: Wont fix Package: DOM XML related
PHP Version: 6CVS-2006-08-21 (snap) OS: *
Private report: No CVE-ID: None
 [2006-08-21 23:25 UTC] spam02 at pornel dot net
Description:
------------
From W3C: XHTML/1.0 is a reformulation of HTML 4 in XML. The semantics of HTML and XHTML elements are identical.

loadHTML() should put loaded elements in XHTML namespace to preserve their semantics. These aren't just any random elements - these are HTML elements, and HTML elements in XML (therefore DOM) are in "http://www.w3.org/1999/xhtml" namespace.

This isn't purely academic problem. 

It's difficult to handle both HTML and XHTML uniformly using DOM in PHP - difference in namespaces causes xpath/XSLT to behave differently.

AFAIK there's no trivial method of changing namespace of all document elements, so namespace returned by loadHTML() is quite important.


SUGGESTED CHANGE
Simply putting elements in a namespace will break backwards-compatibility a little (xpath queries for example). Therefore I suggest adding optional boolean argument to loadHTML() and loadHTMLFile() that enables new behavior.

Reproduce code:
---------------
<?php 
$html = new DOMDocument(); $html->loadHTML('<html><body>hello');
$xhtml = new DOMDocument(); $xhtml->loadXML('<html xmlns="http://www.w3.org/1999/xhtml"><body>hello</body></html>');

function test($doc)
{
$x = new DOMXPath($doc);
$x->registerNamespace("x","http://www.w3.org/1999/xhtml");
echo $x->evaluate("string(//x:body)");
}

test($html);
test($xhtml);


// local-name() could be used as workaround in this practicular text-case, however this isn't possible/feasible in every case.


Expected result:
----------------
hellohello

Actual result:
--------------
hello


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2006-08-22 05:41 UTC] chregu@php.net
loadHTML only properly deals with HTML(4) documents (which are 
by definition not namespace aware and therefore discards 
them). 

If you want to keep the namespaces, use loadXML() or, for your 
proposal, use the tidy extension to make XHTML out of your 
HTML documents.


 [2006-08-22 09:34 UTC] spam02 at pornel dot net
I'm not saying that loadHTML should read namespace from input - ofcourse HTML/SGML syntax doesn't support it. 

But by loading HTML into XML DOM you're basically converting it to namespace aware representation. Being HTML is implied by source format, and not explictly stated in the document, so namespace information needs to be added.

Namespace is used to distinguish incompatible nodes in DOM, however DOM representation of XHTML and HTML is 100% compatible (same semantics, structure).

Tidy nodes aren't compatible with PHP DOM extension, so this is not a solution.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Sep 12 15:01:28 2024 UTC