php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #70934 DOMDocument return wrong tree by tag in script tag
Submitted: 2015-11-18 10:54 UTC Modified: 2018-03-28 16:56 UTC
Votes:3
Avg. Score:4.3 ± 0.9
Reproduced:3 of 3 (100.0%)
Same Version:0 (0.0%)
Same OS:1 (33.3%)
From: sashott at abv dot bg Assigned:
Status: Not a bug Package: DOM XML related
PHP Version: 5.5Git-2015-11-18 (snap) OS: Linux
Private report: No CVE-ID: None
 [2015-11-18 10:54 UTC] sashott at abv dot bg
Description:
------------
DOMDocument return wrong tree by tag in variable in script tag. As example in DIV, a closing tag DIV written in script tag (in text variable), close the previous DIV.


Test script:
---------------
<?php
$html_content='<!DOCTYPE html>
<html>
	<body>
		<div id="something">
			<script>
			var somevar=\'<div></div>\';
			</script>
			<div></div>
			<div></div>
		</div>
	</body>
</html>
';
$oldSetting=libxml_use_internal_errors(true);
libxml_clear_errors();
$html=new DOMDocument();
$html->loadHtml($html_content);
$node=$html->getElementById('something');
echo "<pre>";
foreach($node->childNodes as $child_node){
	echo $child_node->nodeName."\n";
}
echo "</pre>";
libxml_clear_errors();
libxml_use_internal_errors($oldSetting);
//Result:
//#text
//script
//Error appear from: var somevar=\'<div></div>\';
?>


Expected result:
----------------
script
div
div
//Can have some #text between them.

Actual result:
--------------
#text
script

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2018-03-28 16:04 UTC] luca dot canella at diennea dot com
Bug confirmed and I wish to add another caveat. When script tag contains "text/html" the probability to have a broken tree grows exponentially.
I also tried adding some <![CDATA[ ... ]]> but this isn't solving the problem.

Example:

----------- html.html ----------- 

<!doctype html>
<html lang="en">
	<head>
		<meta charset="utf-8">
		<meta http-equiv="X-UA-Compatible" content="IE=edge">
		<title>Titolo</title>
		<meta name="description" content="aaaa">
		<meta name="viewport" content="width=device-width, initial-scale=1">
	</head>
	<body>
		<div class="wrapper">
			<header></header>
			<main>
				<script type="text/html" id="ciao2"><![CDATA[
					<div>DDD</div>
				]]></script>
			</main>
			<footer></footer>
		</div>
		<script type="text/html" id="ciao">
			<div>AAAA</div>
		</script>
		<script>
			console.log(document.getElementById('ciao'), document.getElementById('ciao2'), '<div>ZZZZ</div>');
		</script>
		<div>BBBB</div>
	</body>
</html>

----------- test.php ----------- 

<?php
$html = file_get_contents('html.html');
$dom = DOMDocument::loadHtml($html);
echo $dom->saveHTML();

--------------------------------
 [2018-03-28 16:28 UTC] requinix@php.net
-Status: Open +Status: Not a bug
 [2018-03-28 16:28 UTC] requinix@php.net
Actually this is expected behavior. The HTML parser is (still) only compliant with HTML 4, and in that scripts end at the first </ even if it's not a </script>.
https://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data

And HTML doesn't support CDATAs so adding one of those won't work.
 [2018-03-28 16:49 UTC] spam2 at rhsoft dot net
according to https://validator.w3.org/#validate_by_input+with_options this is 100% clean XHMTL Markup long before HTNL5 was even considered

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
  <title>Test</title>
  <script type="text/javascript">
   <![CDATA[
    <div>DDD</div>
   ]]>
  </script>
 </head>
 <body>
  <p>
   Test
  </p>
 </body>
</html>
 [2018-03-28 16:56 UTC] requinix@php.net
I don't see what point you're trying to make by providing markup that was significantly altered to be compatible with a doctype that is not relevant to this bug report.
 [2018-03-28 17:05 UTC] spam2 at rhsoft dot net
that the parser is heavily broken even for nearly 20 years old standards

<?php
$html = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
  <title>Test</title>
  <script type="text/javascript">
   <![CDATA[
    <div>DDD</div>
   ]]>
  </script>
 </head>
 <body>
  <p>
   Test
  </p>
 </body>
</html>';
$doc = new DOMDocument;
$dom = $doc->loadHtml($html);

[harry@srv-rhsoft:/downloads]$ php test.php
Warning: DOMDocument::loadHTML(): Unexpected end tag : div in Entity, line: 8 in /mnt/data/downloads/test.php on line 20
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Apr 19 10:01:28 2024 UTC