php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #60021 DOMDocument errors on HTML5 tags
Submitted: 2011-10-09 05:24 UTC Modified: 2012-04-02 04:13 UTC
Votes:52
Avg. Score:4.7 ± 0.6
Reproduced:48 of 48 (100.0%)
Same Version:16 (33.3%)
Same OS:14 (29.2%)
From: drgroove at gmail dot com Assigned:
Status: Suspended Package: DOM XML related
PHP Version: 5.3.8 OS: Mac OS X
Private report: No CVE-ID: None
Have you experienced this issue?
Rate the importance of this bug to you:

 [2011-10-09 05:24 UTC] drgroove at gmail dot com
Description:
------------
Loading HTML documents through DOMDocument->loadHTMLFile(), when the HTML file contains certain new HTML5 tags, results in this error: 

Warning: DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: Tag footer invalid in {file path here}

<footer> is a new HTML5 tag.  The error appears for other HTML5 tags as well (eg, <header>). 



Test script:
---------------
// TEST.html
<header>
     Some text here
</header>

// TEST.php
<?php
$dom_document 	= new DOMDocument(); 
$dom_document->loadHTMLFile("TEST.html");
?>



Expected result:
----------------
DOMDocument should not fail on HTML5 tags. 

Actual result:
--------------
Warning: DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: Tag footer invalid in {file path here}


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2012-04-02 03:41 UTC] drgroove at gmail dot com
Any progress on resolving this?  Working w/ DOMDocument and HTML5 is a huge pain in the butt right now; you have to write custom error handlers for things like <header/>, <nav/>, and other HTML5 tags.  

Also, just entered a bug report for SimpleXML (where tags w/ both attributes and text have their attributes dropped).  Both DOMDocument and SimpleXML need updates... it's very difficult to work w/ HTML and XML when both of these APIs have so many issues. 

Thanks for your help everyone :)
 [2012-04-02 04:13 UTC] aharvey@php.net
-Status: Open +Status: Suspended
 [2012-04-02 04:13 UTC] aharvey@php.net
It's a valid issue, but it's really an upstream one: libxml2's HTML parser only 
supports HTML 4.01, so until that's extended to support HTML5 or a new parser is 
added to libxml2, there's little to be done in PHP proper.

There are userspace parsers available: html5lib will parse documents according 
to the HTML5 algorithm and give you a DOMDocument to work with.

Suspending for now. Given that the issue was first raised upstream in 2008, I 
wouldn't hold your breath (although I suspect they'd love a patch).
 [2016-02-04 07:45 UTC] cweiske@php.net
There wasn't even a HTML5 tag bug report for libxml2 yet; I've created it now:
 https://bugzilla.gnome.org/show_bug.cgi?id=761534
 [2020-01-31 13:56 UTC] matthewheroux at gmail dot com
HTML5 has been the standard for years. This is negatively impacting PHP. Huge issue with keeping PHP relevant and modern. It seems like such as achievable fix. A few entities, a different doc tag.
 
PHP Copyright © 2001-2020 The PHP Group
All rights reserved.
Last updated: Fri Jul 03 18:01:26 2020 UTC