php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #70934 DOMDocument return wrong tree by tag in script tag
Submitted: 2015-11-18 10:54 UTC Modified: 2018-03-28 16:56 UTC
Votes:3
Avg. Score:4.3 ± 0.9
Reproduced:3 of 3 (100.0%)
Same Version:0 (0.0%)
Same OS:1 (33.3%)
From: sashott at abv dot bg Assigned:
Status: Not a bug Package: DOM XML related
PHP Version: 5.5Git-2015-11-18 (snap) OS: Linux
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: sashott at abv dot bg
New email:
PHP Version: OS:

 

 [2015-11-18 10:54 UTC] sashott at abv dot bg
Description:
------------
DOMDocument return wrong tree by tag in variable in script tag. As example in DIV, a closing tag DIV written in script tag (in text variable), close the previous DIV.


Test script:
---------------
<?php
$html_content='<!DOCTYPE html>
<html>
	<body>
		<div id="something">
			<script>
			var somevar=\'<div></div>\';
			</script>
			<div></div>
			<div></div>
		</div>
	</body>
</html>
';
$oldSetting=libxml_use_internal_errors(true);
libxml_clear_errors();
$html=new DOMDocument();
$html->loadHtml($html_content);
$node=$html->getElementById('something');
echo "<pre>";
foreach($node->childNodes as $child_node){
	echo $child_node->nodeName."\n";
}
echo "</pre>";
libxml_clear_errors();
libxml_use_internal_errors($oldSetting);
//Result:
//#text
//script
//Error appear from: var somevar=\'<div></div>\';
?>


Expected result:
----------------
script
div
div
//Can have some #text between them.

Actual result:
--------------
#text
script

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2018-03-28 16:04 UTC] luca dot canella at diennea dot com
Bug confirmed and I wish to add another caveat. When script tag contains "text/html" the probability to have a broken tree grows exponentially.
I also tried adding some <![CDATA[ ... ]]> but this isn't solving the problem.

Example:

----------- html.html ----------- 

<!doctype html>
<html lang="en">
	<head>
		<meta charset="utf-8">
		<meta http-equiv="X-UA-Compatible" content="IE=edge">
		<title>Titolo</title>
		<meta name="description" content="aaaa">
		<meta name="viewport" content="width=device-width, initial-scale=1">
	</head>
	<body>
		<div class="wrapper">
			<header></header>
			<main>
				<script type="text/html" id="ciao2"><![CDATA[
					<div>DDD</div>
				]]></script>
			</main>
			<footer></footer>
		</div>
		<script type="text/html" id="ciao">
			<div>AAAA</div>
		</script>
		<script>
			console.log(document.getElementById('ciao'), document.getElementById('ciao2'), '<div>ZZZZ</div>');
		</script>
		<div>BBBB</div>
	</body>
</html>

----------- test.php ----------- 

<?php
$html = file_get_contents('html.html');
$dom = DOMDocument::loadHtml($html);
echo $dom->saveHTML();

--------------------------------
 [2018-03-28 16:28 UTC] requinix@php.net
-Status: Open +Status: Not a bug
 [2018-03-28 16:28 UTC] requinix@php.net
Actually this is expected behavior. The HTML parser is (still) only compliant with HTML 4, and in that scripts end at the first </ even if it's not a </script>.
https://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data

And HTML doesn't support CDATAs so adding one of those won't work.
 [2018-03-28 16:49 UTC] spam2 at rhsoft dot net
according to https://validator.w3.org/#validate_by_input+with_options this is 100% clean XHMTL Markup long before HTNL5 was even considered

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
  <title>Test</title>
  <script type="text/javascript">
   <![CDATA[
    <div>DDD</div>
   ]]>
  </script>
 </head>
 <body>
  <p>
   Test
  </p>
 </body>
</html>
 [2018-03-28 16:56 UTC] requinix@php.net
I don't see what point you're trying to make by providing markup that was significantly altered to be compatible with a doctype that is not relevant to this bug report.
 [2018-03-28 17:05 UTC] spam2 at rhsoft dot net
that the parser is heavily broken even for nearly 20 years old standards

<?php
$html = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
  <title>Test</title>
  <script type="text/javascript">
   <![CDATA[
    <div>DDD</div>
   ]]>
  </script>
 </head>
 <body>
  <p>
   Test
  </p>
 </body>
</html>';
$doc = new DOMDocument;
$dom = $doc->loadHtml($html);

[harry@srv-rhsoft:/downloads]$ php test.php
Warning: DOMDocument::loadHTML(): Unexpected end tag : div in Entity, line: 8 in /mnt/data/downloads/test.php on line 20
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Wed Feb 05 20:01:30 2025 UTC