php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #66712 Using LIBXML_HTML_NOIMPLIED on DomDocument::loadHTML() gives unexpected results
Submitted: 2014-02-14 01:23 UTC Modified: 2023-09-07 18:46 UTC
Votes:8
Avg. Score:4.1 ± 0.9
Reproduced:7 of 7 (100.0%)
Same Version:1 (14.3%)
Same OS:1 (14.3%)
From: chanson at mesd dot k12 dot or dot us Assigned: nielsdos (profile)
Status: Closed Package: DOM XML related
PHP Version: 5.5.9 OS: Fedora 20 x86/64
Private report: No CVE-ID: None
 [2014-02-14 01:23 UTC] chanson at mesd dot k12 dot or dot us
Description:
------------
Using the LIBXML_HTML_NOIMPLIED predefined constant in the DomDocument class has unexpected results.

The nodeValue of any first DOMNodeList item always contains all the values of every node list item in the collection only when the optional LIBXML_HTML_NOIMPLIED predefined constant is passed to the loadHTML() method.

I am currently running:
- PHP Version 5.5.8
- libxml Version 2.9.1

Test script:
---------------
$html = '<h1>Foo</h1><h2>Bar</h2><p>lorem ipsum</p>';
$dom = new \DomDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED);
$nodes = $dom->getElementsByTagName('*');
echo $nodes->item(0)->tagName . ' -> ' . $nodes->item(0)->nodeValue . '<br/>';
foreach ($nodes as $node) {
    echo $node->tagName . ' -> ' . $node->nodeValue . '<br/>';
}

Expected result:
----------------
h1 -> FooBarlorem ipsum
h1 -> FooBarlorem ipsum
h2 -> Bar
p -> lorem ipsum

Actual result:
--------------
h1 -> Foo
h1 -> Foo
h2 -> Bar
p -> lorem ipsum

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2015-04-15 14:41 UTC] cmb@php.net
-Summary: Using LIBXML_NOHTML_IMPLIED on DomDocument::loadHTML() gives unexpected results +Summary: Using LIBXML_HTML_NOIMPLIED on DomDocument::loadHTML() gives unexpected results -Status: Open +Status: Feedback -Assigned To: +Assigned To: cmb
 [2015-04-15 14:41 UTC] cmb@php.net
I am not able to reproduce this behavior, see
<http://3v4l.org/SFlNo>. Can you confirm?
 [2015-04-15 14:46 UTC] cmb@php.net
-Status: Feedback +Status: Verified -Assigned To: cmb +Assigned To:
 [2015-04-15 14:46 UTC] cmb@php.net
Well, of course the behavior is reproducible -- only the expected
and actual behavior sections in the report are mixed up.
 [2015-11-09 14:47 UTC] dlundgren at syberisle dot net
This may be more of a documentation issue than a code issue. After encountering the same problem recently, I found that the first element becomes the root element of the document. To offset this I wrapped the html fragment in a root element, and I was able to work with it that way.

This is most likely due to our lack of understanding that the DOM Level 2 requires a document to have a single documentElement, with children under it. I had to look that up to understand what I was doing wrong.
 [2023-06-08 20:45 UTC] nielsdos@php.net
-Type: Bug +Type: Documentation Problem
 [2023-06-08 20:45 UTC] nielsdos@php.net
Agreed that this is a documentation issue. The document element indeed becomes the <h1> tag and everything becomes nested under it: <h1>Foo<h2>Bar</h2><p>lorem ipsum</p></h1>
I think this is correct behaviour.
 [2023-09-07 18:45 UTC] nielsdos@php.net
I contacted the libxml2 developer, and he confirmed it *is* actually a bug and has fixed the bug: https://gitlab.gnome.org/GNOME/libxml2/-/issues/584
The next release of libxml2 will have the right behaviour, i.e. no nesting inside h1.
 [2023-09-07 18:46 UTC] nielsdos@php.net
-Status: Verified +Status: Closed -Type: Documentation Problem +Type: Bug -Assigned To: +Assigned To: nielsdos
 [2023-09-07 18:46 UTC] nielsdos@php.net
The fix for this bug has been committed.
If you are still experiencing this bug, try to check out latest source from https://github.com/php/php-src and re-test.
Thank you for the report, and for helping us make PHP better.

Fixed in libxml2.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Apr 25 12:01:31 2024 UTC