|  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #72288 Lines longer than 1000 characters break DOMDocument::loadHTML()
Submitted: 2016-05-30 09:13 UTC Modified: 2021-03-12 12:34 UTC
Avg. Score:5.0 ± 0.0
Reproduced:1 of 2 (50.0%)
Same Version:0 (0.0%)
Same OS:1 (100.0%)
From: maciej at klepaczewski dot com Assigned: cmb (profile)
Status: Not a bug Package: DOM XML related
PHP Version: 5.6.22 OS: Windows, Linux
Private report: No CVE-ID: None
View Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.
Block user comment
Status: Assign to:
Bug Type:
From: maciej at klepaczewski dot com
New email:
PHP Version: OS:


 [2016-05-30 09:13 UTC] maciej at klepaczewski dot com
When a single line contains more than 1000 characters (might depend on EOL format) DOMDocument::loadHTML() errs with message similar to:

Warning: DOMDocument::loadHTML(): Unexpected end tag : head in Entity, line: 7 in test.php on line 4

Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <body> tag in Entity, line: 8 in test.php on line 4

Test script:
$html = <<<HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "">
<html xmlns="" xml:lang="da" lang="da">
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
echo "==== < 1000 spaces in head===\n";
$x = new DOMDocument();
$x->loadHTML(str_replace('<!--placeholder-->', str_repeat(' ', 995), $html));
echo "==== 1000 spaces in head===\n";
$y = new DOMDocument();
$y->loadHTML(str_replace('<!--placeholder-->', str_repeat(' ', 1000), $html));

Expected result:
==== < 1000 spaces in head===
==== 1000 spaces in head===

Actual result:
==== < 1000 spaces in head===
==== 1000 spaces in head===

Warning: DOMDocument::loadHTML(): Unexpected end tag : head in Entity, line: 7 in test.php on line 20

Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <body> tag in Entity, line: 8 in test.php on line 20


Pull Requests


AllCommentsChangesGit/SVN commitsRelated reports
 [2016-05-30 09:28 UTC] maciej at klepaczewski dot com
One thing I forgot to mention is that the <body> and its attributes are replaced with empty <body>. So if original body has <body class="page"> then after the warning it's replaced with vanilla <body> tag.

Improved test script showing this behavior:
$html = <<<HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "">
<html xmlns="" xml:lang="da" lang="da">
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<body class="page">
echo "==== < 1000 spaces in head===\n";
$x = new DOMDocument();
$x->loadHTML(str_replace('<!--placeholder-->', str_repeat(' ', 995), $html));
echo $x->saveHTML();
echo "==== 1000 spaces in head===\n";
$y = new DOMDocument();
$y->loadHTML(str_replace('<!--placeholder-->', str_repeat(' ', 1000), $html));
echo $y->saveHTML();
 [2016-05-31 20:27 UTC]
-Status: Open +Status: Verified
 [2016-05-31 20:27 UTC]
I can reproduce this behavior.  Might be an libxml issue.
 [2021-03-12 12:32 UTC]
-Status: Verified +Status: Not a bug -Assigned To: +Assigned To: cmb
 [2021-03-12 12:32 UTC]
This problem is not in PHP nor its usage of libxml2, but rather in
libxml2 itself.  During parsing, there is a buffer of 1000
chars[1].  If this is filled, apparently libxml2 assumes CDATA
content in the head element (instead of ignoring the whitespace),
which is not allowed[2], and so a <p> is implied, which breaks the
DOM.  You can see that when you inspect $y->saveHTML().

Consider to bring that up with libxml2.

[1] <>
[2] <>
 [2021-03-12 12:34 UTC]
To clarify: this is not about the line length, but rather about
the amount of whitespace in the <head> element.
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Wed Feb 05 17:01:30 2025 UTC