php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #74858 DOMDocument loadHTML parses html tags inside cdata
Submitted: 2017-07-05 06:45 UTC Modified: 2017-07-09 14:16 UTC
From: qdinar at gmail dot com Assigned:
Status: Not a bug Package: DOM XML related
PHP Version: 5.6.30 OS: windows 10
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: qdinar at gmail dot com
New email:
PHP Version: OS:

Further comment on this bug is unnecessary.

 

 [2017-07-05 06:45 UTC] qdinar at gmail dot com
Description:
------------
cdata which was used to escape html tags inside strings inside script tags failed to perform that task while it was feed to PHP's DOMDocument's loadHTML , which is made with libxml.

you can see in the example that "c='456'; //]]> " - content of script element is going to be outputted to user.

i reported this for libxml to https://bugzilla.gnome.org/show_bug.cgi?id=784517 but it was suggested to report to php.

i have found similar bug https://bugs.php.net/bug.php?id=71452 here but it is for case without "cdata", so i report another bug.

Test script:
---------------
$test_content='
<script>
//<![CDATA[
a=\'123\';
b=\'</script>\';
c=\'456\';
//]]>
</script>
';
$d=new DOMDocument();
$d->loadHTML($test_content);
echo $d->saveHTML();


Expected result:
----------------
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head>
<script>
//<![CDATA[
a='123';
b='</script>';
c='456';
//]]>
</script>
</head>
<body></body>
</html>


Actual result:
--------------
PHP Warning:  DOMDocument::loadHTML(): Unexpected end tag : script in Entity, line: 8 in C:\xampp\htdocs\test\dom_cdata.php on line 13

Warning: DOMDocument::loadHTML(): Unexpected end tag : script in Entity, line: 8 in C:\xampp\htdocs\test\dom_cdata.php on line 13
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><script>
//<![CDATA[
a='123';
b='</script></head><body><p>';
c='456';
//]]&gt;
</p></body></html>


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2017-07-05 07:16 UTC] requinix@php.net
-Status: Open +Status: Not a bug -Package: *General Issues +Package: DOM XML related
 [2017-07-05 07:16 UTC] requinix@php.net
CDATA sections are not interpreted in HTML (only in XML and XHTML) and HTML parsers are not aware of the Javascript language, so that </script> in the string will be treated as the end tag for the earlier <script>.

If you want a literal "</script>" then escape the slash like
  b='<\/script>';
This breaks apart the </ sequence and prevents the rest from being parsed as an end tag.

The other bug report may look similar but it is a different issue.
 [2017-07-06 09:02 UTC] qdinar at gmail dot com
requinix, loadHTML parses html tags also inside usual html comments: https://bugs.php.net/bug.php?id=74863 .
 [2017-07-09 14:07 UTC] qdinar at gmail dot com
requinix, i tried same script with this (instead of "echo $d->saveHTML();") :
$es=$d->getElementsByTagName('script');
echo $es->item(0)->childNodes->item(0)->nodeType;
and it outputted 4 which is:
XML_CDATA_SECTION_NODE (integer) 	4 	Node is a DOMCharacterData
- http://php.net/manual/en/dom.constants.php
 [2017-07-09 14:16 UTC] requinix@php.net
-Block user comment: No +Block user comment: Yes
 [2017-07-09 14:16 UTC] requinix@php.net
http://www.flightlab.com/~joe/sgml/cdata.html dated but still mostly accurate
#5 is what is not supported in HTML 4.

Look. Whether you consider the current behavior to be a bug ("libxml does not parse using current rules") or a request ("libxml should add support for HTML 5"), there is nothing for PHP to do here because this is not a PHP issue. It is a libxml issue. If you want to keep pushing for this then head over to their project.
http://xmlsoft.org/
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Wed Feb 05 20:01:30 2025 UTC