php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #80679 if loaded document starts with text, some nodes will be lost
Submitted: 2021-01-28 10:51 UTC Modified: 2021-01-28 10:55 UTC
From: andrey at email dot dp dot ua Assigned: cmb (profile)
Status: Not a bug Package: DOM XML related
PHP Version: Irrelevant OS: Debian Linux
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: andrey at email dot dp dot ua
New email:
PHP Version: OS:

 

 [2021-01-28 10:51 UTC] andrey at email dot dp dot ua
Description:
------------
If document starts with text, loading it by loadHTML will corrupt some element nodes:

1. The text at the beginning of the document will be placed inside <p></p> tags.

2. The text inside <p></p> tags will be placed after document declaration

3. <html> node in document will be lost but child nodes will be preserved

4. <head> node will be lost but child nodes will be preserved

5. if document has no head element but have another one, it will be included in <p></p> tags with starting text


The behaviour is different for some versions of libxml:


libxml 2.9.1:
starting text enclosed in <p></p> tags and <head> element's child nodes will be placed before <body> node.


libxml 2.9.4:
starting text enclosed in <p></p> tags and <head> element's child nodes will be included inside <body> node

Test script:
---------------
<?php

$html = "some text <!DOCTYPE html>
<html>
<head>
<title>Title of the document</title>
</head>

<body>
The content of the document......
</body>

</html>";


echo "\n\nStarting with text\n";
echo "----------------------\n";

$dom = new \DOMDocument();
$dom->formatOutput  = true;

$dom->loadHTML($html);
echo $dom->saveHTML();




$html = str_replace('head>', 'not_head>', $html);

echo "\n\nStarting with text and no head node\n";
echo "----------------------\n";

$dom = new \DOMDocument();
$dom->formatOutput  = true;

$dom->loadHTML($html);
echo $dom->saveHTML();

Expected result:
----------------
Starting with text
----------------------
some text <!DOCTYPE html>
<html>
<head>
<title>Title of the document</title>
</head>

<body>
The content of the document......
</body>

</html>



Starting with text and no head node
----------------------
some text <!DOCTYPE html>
<html>
<not_head>
<title>Title of the document</title>
</not_head>

<body>
The content of the document......
</body>

</html>




Also some kind of expected result:
==================================

Starting with text
----------------------
<!DOCTYPE html>
some text
<html>
<head>
<title>Title of the document</title>
</head>

<body>
The content of the document......
</body>

</html>



Starting with text and no head node
----------------------
<!DOCTYPE html>
some text
<html>
<not_head>
<title>Title of the document</title>
</not_head>

<body>
The content of the document......
</body>

</html>

Actual result:
--------------
Starting with text
----------------------
<!DOCTYPE html>
<html><body>
<p>some text 

</p>
<title>Title of the document</title>
The content of the document......


</body></html>


Starting with text and no head node
----------------------
<!DOCTYPE html>
<html><body>
<p>some text 

<not_head><title>Title of the document</title></not_head></p>
The content of the document......


</body></html>

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2021-01-28 10:55 UTC] cmb@php.net
-Status: Open +Status: Not a bug -Assigned To: +Assigned To: cmb
 [2021-01-28 10:55 UTC] cmb@php.net
> The behaviour is different for some versions of libxml:

So this is apparently an upstream (i.e. libxml2) issue.  Consider
to report it there.
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Wed Feb 05 20:01:30 2025 UTC