php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #72288 Lines longer than 1000 characters break DOMDocument::loadHTML()
Submitted: 2016-05-30 09:13 UTC Modified: 2021-03-12 12:34 UTC
Votes:2
Avg. Score:5.0 ± 0.0
Reproduced:1 of 2 (50.0%)
Same Version:0 (0.0%)
Same OS:1 (100.0%)
From: maciej at klepaczewski dot com Assigned: cmb (profile)
Status: Not a bug Package: DOM XML related
PHP Version: 5.6.22 OS: Windows, Linux
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: maciej at klepaczewski dot com
New email:
PHP Version: OS:

 

 [2016-05-30 09:13 UTC] maciej at klepaczewski dot com
Description:
------------
When a single line contains more than 1000 characters (might depend on EOL format) DOMDocument::loadHTML() errs with message similar to:

=====
Warning: DOMDocument::loadHTML(): Unexpected end tag : head in Entity, line: 7 in test.php on line 4

Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <body> tag in Entity, line: 8 in test.php on line 4
=====

Test script:
---------------
<?php
error_reporting(E_ALL);
$html = <<<HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="da" lang="da">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
    <title>Test</title>
<!--placeholder-->
</head>
<body>
</body>
</html>
HTML;
echo "==== < 1000 spaces in head===\n";
$x = new DOMDocument();
$x->loadHTML(str_replace('<!--placeholder-->', str_repeat(' ', 995), $html));
echo "==== 1000 spaces in head===\n";
$y = new DOMDocument();
$y->loadHTML(str_replace('<!--placeholder-->', str_repeat(' ', 1000), $html));

Expected result:
----------------
==== < 1000 spaces in head===
==== 1000 spaces in head===

Actual result:
--------------
==== < 1000 spaces in head===
==== 1000 spaces in head===

Warning: DOMDocument::loadHTML(): Unexpected end tag : head in Entity, line: 7 in test.php on line 20

Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <body> tag in Entity, line: 8 in test.php on line 20

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2016-05-30 09:28 UTC] maciej at klepaczewski dot com
One thing I forgot to mention is that the <body> and its attributes are replaced with empty <body>. So if original body has <body class="page"> then after the warning it's replaced with vanilla <body> tag.

Improved test script showing this behavior:
-----
<?php
error_reporting(E_ALL);
$html = <<<HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="da" lang="da">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
    <title>Test</title>
<!--placeholder-->
</head>
<body class="page">
</body>
</html>
HTML;
echo "==== < 1000 spaces in head===\n";
$x = new DOMDocument();
$x->loadHTML(str_replace('<!--placeholder-->', str_repeat(' ', 995), $html));
echo $x->saveHTML();
echo "==== 1000 spaces in head===\n";
$y = new DOMDocument();
$y->loadHTML(str_replace('<!--placeholder-->', str_repeat(' ', 1000), $html));
echo $y->saveHTML();
 [2016-05-31 20:27 UTC] cmb@php.net
-Status: Open +Status: Verified
 [2016-05-31 20:27 UTC] cmb@php.net
I can reproduce this behavior.  Might be an libxml issue.
 [2021-03-12 12:32 UTC] cmb@php.net
-Status: Verified +Status: Not a bug -Assigned To: +Assigned To: cmb
 [2021-03-12 12:32 UTC] cmb@php.net
This problem is not in PHP nor its usage of libxml2, but rather in
libxml2 itself.  During parsing, there is a buffer of 1000
chars[1].  If this is filled, apparently libxml2 assumes CDATA
content in the head element (instead of ignoring the whitespace),
which is not allowed[2], and so a <p> is implied, which breaks the
DOM.  You can see that when you inspect $y->saveHTML().

Consider to bring that up with libxml2.

[1] <https://github.com/GNOME/libxml2/blob/v2.9.10/HTMLparser.c#L51>
[2] <https://github.com/GNOME/libxml2/blob/v2.9.10/HTMLparser.c#L1147-L1158>
 [2021-03-12 12:34 UTC] cmb@php.net
To clarify: this is not about the line length, but rather about
the amount of whitespace in the <head> element.
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Wed Feb 05 20:01:30 2025 UTC