php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #72288 Lines longer than 1000 characters break DOMDocument::loadHTML()
Submitted: 2016-05-30 09:13 UTC Modified: 2016-05-31 20:27 UTC
Votes:1
Avg. Score:5.0 ± 0.0
Reproduced:0 of 1 (0.0%)
From: maciej at klepaczewski dot com Assigned:
Status: Verified Package: DOM XML related
PHP Version: 5.6.22 OS: Windows, Linux
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: maciej at klepaczewski dot com
New email:
PHP Version: OS:

 

 [2016-05-30 09:13 UTC] maciej at klepaczewski dot com
Description:
------------
When a single line contains more than 1000 characters (might depend on EOL format) DOMDocument::loadHTML() errs with message similar to:

=====
Warning: DOMDocument::loadHTML(): Unexpected end tag : head in Entity, line: 7 in test.php on line 4

Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <body> tag in Entity, line: 8 in test.php on line 4
=====

Test script:
---------------
<?php
error_reporting(E_ALL);
$html = <<<HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="da" lang="da">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
    <title>Test</title>
<!--placeholder-->
</head>
<body>
</body>
</html>
HTML;
echo "==== < 1000 spaces in head===\n";
$x = new DOMDocument();
$x->loadHTML(str_replace('<!--placeholder-->', str_repeat(' ', 995), $html));
echo "==== 1000 spaces in head===\n";
$y = new DOMDocument();
$y->loadHTML(str_replace('<!--placeholder-->', str_repeat(' ', 1000), $html));

Expected result:
----------------
==== < 1000 spaces in head===
==== 1000 spaces in head===

Actual result:
--------------
==== < 1000 spaces in head===
==== 1000 spaces in head===

Warning: DOMDocument::loadHTML(): Unexpected end tag : head in Entity, line: 7 in test.php on line 20

Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <body> tag in Entity, line: 8 in test.php on line 20

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2016-05-30 09:28 UTC] maciej at klepaczewski dot com
One thing I forgot to mention is that the <body> and its attributes are replaced with empty <body>. So if original body has <body class="page"> then after the warning it's replaced with vanilla <body> tag.

Improved test script showing this behavior:
-----
<?php
error_reporting(E_ALL);
$html = <<<HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="da" lang="da">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
    <title>Test</title>
<!--placeholder-->
</head>
<body class="page">
</body>
</html>
HTML;
echo "==== < 1000 spaces in head===\n";
$x = new DOMDocument();
$x->loadHTML(str_replace('<!--placeholder-->', str_repeat(' ', 995), $html));
echo $x->saveHTML();
echo "==== 1000 spaces in head===\n";
$y = new DOMDocument();
$y->loadHTML(str_replace('<!--placeholder-->', str_repeat(' ', 1000), $html));
echo $y->saveHTML();
 [2016-05-31 20:27 UTC] cmb@php.net
-Status: Open +Status: Verified
 [2016-05-31 20:27 UTC] cmb@php.net
I can reproduce this behavior.  Might be an libxml issue.
 
PHP Copyright © 2001-2020 The PHP Group
All rights reserved.
Last updated: Sat Jan 18 21:01:23 2020 UTC