php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #72288 Lines longer than 1000 characters break DOMDocument::loadHTML()
Submitted: 2016-05-30 09:13 UTC Modified: 2021-03-12 12:34 UTC
Votes:2
Avg. Score:5.0 ± 0.0
Reproduced:1 of 2 (50.0%)
Same Version:0 (0.0%)
Same OS:1 (100.0%)
From: maciej at klepaczewski dot com Assigned: cmb (profile)
Status: Not a bug Package: DOM XML related
PHP Version: 5.6.22 OS: Windows, Linux
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: maciej at klepaczewski dot com
New email:
PHP Version: OS:

 

 [2016-05-30 09:13 UTC] maciej at klepaczewski dot com
Description:
------------
When a single line contains more than 1000 characters (might depend on EOL format) DOMDocument::loadHTML() errs with message similar to:

=====
Warning: DOMDocument::loadHTML(): Unexpected end tag : head in Entity, line: 7 in test.php on line 4

Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <body> tag in Entity, line: 8 in test.php on line 4
=====

Test script:
---------------
<?php
error_reporting(E_ALL);
$html = <<<HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="da" lang="da">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
    <title>Test</title>
<!--placeholder-->
</head>
<body>
</body>
</html>
HTML;
echo "==== < 1000 spaces in head===\n";
$x = new DOMDocument();
$x->loadHTML(str_replace('<!--placeholder-->', str_repeat(' ', 995), $html));
echo "==== 1000 spaces in head===\n";
$y = new DOMDocument();
$y->loadHTML(str_replace('<!--placeholder-->', str_repeat(' ', 1000), $html));

Expected result:
----------------
==== < 1000 spaces in head===
==== 1000 spaces in head===

Actual result:
--------------
==== < 1000 spaces in head===
==== 1000 spaces in head===

Warning: DOMDocument::loadHTML(): Unexpected end tag : head in Entity, line: 7 in test.php on line 20

Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <body> tag in Entity, line: 8 in test.php on line 20

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2016-05-30 09:28 UTC] maciej at klepaczewski dot com
One thing I forgot to mention is that the <body> and its attributes are replaced with empty <body>. So if original body has <body class="page"> then after the warning it's replaced with vanilla <body> tag.

Improved test script showing this behavior:
-----
<?php
error_reporting(E_ALL);
$html = <<<HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="da" lang="da">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
    <title>Test</title>
<!--placeholder-->
</head>
<body class="page">
</body>
</html>
HTML;
echo "==== < 1000 spaces in head===\n";
$x = new DOMDocument();
$x->loadHTML(str_replace('<!--placeholder-->', str_repeat(' ', 995), $html));
echo $x->saveHTML();
echo "==== 1000 spaces in head===\n";
$y = new DOMDocument();
$y->loadHTML(str_replace('<!--placeholder-->', str_repeat(' ', 1000), $html));
echo $y->saveHTML();
 [2016-05-31 20:27 UTC] cmb@php.net
-Status: Open +Status: Verified
 [2016-05-31 20:27 UTC] cmb@php.net
I can reproduce this behavior.  Might be an libxml issue.
 [2021-03-12 12:32 UTC] cmb@php.net
-Status: Verified +Status: Not a bug -Assigned To: +Assigned To: cmb
 [2021-03-12 12:32 UTC] cmb@php.net
This problem is not in PHP nor its usage of libxml2, but rather in
libxml2 itself.  During parsing, there is a buffer of 1000
chars[1].  If this is filled, apparently libxml2 assumes CDATA
content in the head element (instead of ignoring the whitespace),
which is not allowed[2], and so a <p> is implied, which breaks the
DOM.  You can see that when you inspect $y->saveHTML().

Consider to bring that up with libxml2.

[1] <https://github.com/GNOME/libxml2/blob/v2.9.10/HTMLparser.c#L51>
[2] <https://github.com/GNOME/libxml2/blob/v2.9.10/HTMLparser.c#L1147-L1158>
 [2021-03-12 12:34 UTC] cmb@php.net
To clarify: this is not about the line length, but rather about
the amount of whitespace in the <head> element.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sat Apr 20 02:01:29 2024 UTC