php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #80679 if loaded document starts with text, some nodes will be lost
Submitted: 2021-01-28 10:51 UTC Modified: 2021-01-28 10:55 UTC
From: andrey at email dot dp dot ua Assigned: cmb (profile)
Status: Not a bug Package: DOM XML related
PHP Version: Irrelevant OS: Debian Linux
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: andrey at email dot dp dot ua
New email:
PHP Version: OS:

 

 [2021-01-28 10:51 UTC] andrey at email dot dp dot ua
Description:
------------
If document starts with text, loading it by loadHTML will corrupt some element nodes:

1. The text at the beginning of the document will be placed inside <p></p> tags.

2. The text inside <p></p> tags will be placed after document declaration

3. <html> node in document will be lost but child nodes will be preserved

4. <head> node will be lost but child nodes will be preserved

5. if document has no head element but have another one, it will be included in <p></p> tags with starting text


The behaviour is different for some versions of libxml:


libxml 2.9.1:
starting text enclosed in <p></p> tags and <head> element's child nodes will be placed before <body> node.


libxml 2.9.4:
starting text enclosed in <p></p> tags and <head> element's child nodes will be included inside <body> node

Test script:
---------------
<?php

$html = "some text <!DOCTYPE html>
<html>
<head>
<title>Title of the document</title>
</head>

<body>
The content of the document......
</body>

</html>";


echo "\n\nStarting with text\n";
echo "----------------------\n";

$dom = new \DOMDocument();
$dom->formatOutput  = true;

$dom->loadHTML($html);
echo $dom->saveHTML();




$html = str_replace('head>', 'not_head>', $html);

echo "\n\nStarting with text and no head node\n";
echo "----------------------\n";

$dom = new \DOMDocument();
$dom->formatOutput  = true;

$dom->loadHTML($html);
echo $dom->saveHTML();

Expected result:
----------------
Starting with text
----------------------
some text <!DOCTYPE html>
<html>
<head>
<title>Title of the document</title>
</head>

<body>
The content of the document......
</body>

</html>



Starting with text and no head node
----------------------
some text <!DOCTYPE html>
<html>
<not_head>
<title>Title of the document</title>
</not_head>

<body>
The content of the document......
</body>

</html>




Also some kind of expected result:
==================================

Starting with text
----------------------
<!DOCTYPE html>
some text
<html>
<head>
<title>Title of the document</title>
</head>

<body>
The content of the document......
</body>

</html>



Starting with text and no head node
----------------------
<!DOCTYPE html>
some text
<html>
<not_head>
<title>Title of the document</title>
</not_head>

<body>
The content of the document......
</body>

</html>

Actual result:
--------------
Starting with text
----------------------
<!DOCTYPE html>
<html><body>
<p>some text 

</p>
<title>Title of the document</title>
The content of the document......


</body></html>


Starting with text and no head node
----------------------
<!DOCTYPE html>
<html><body>
<p>some text 

<not_head><title>Title of the document</title></not_head></p>
The content of the document......


</body></html>

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2021-01-28 10:55 UTC] cmb@php.net
-Status: Open +Status: Not a bug -Assigned To: +Assigned To: cmb
 [2021-01-28 10:55 UTC] cmb@php.net
> The behaviour is different for some versions of libxml:

So this is apparently an upstream (i.e. libxml2) issue.  Consider
to report it there.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Tue Apr 23 06:01:30 2024 UTC