php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #74858 DOMDocument loadHTML parses html tags inside cdata
Submitted: 2017-07-05 06:45 UTC Modified: 2017-07-09 14:16 UTC
From: qdinar at gmail dot com Assigned:
Status: Not a bug Package: DOM XML related
PHP Version: 5.6.30 OS: windows 10
Private report: No CVE-ID: None
 [2017-07-05 06:45 UTC] qdinar at gmail dot com
Description:
------------
cdata which was used to escape html tags inside strings inside script tags failed to perform that task while it was feed to PHP's DOMDocument's loadHTML , which is made with libxml.

you can see in the example that "c='456'; //]]> " - content of script element is going to be outputted to user.

i reported this for libxml to https://bugzilla.gnome.org/show_bug.cgi?id=784517 but it was suggested to report to php.

i have found similar bug https://bugs.php.net/bug.php?id=71452 here but it is for case without "cdata", so i report another bug.

Test script:
---------------
$test_content='
<script>
//<![CDATA[
a=\'123\';
b=\'</script>\';
c=\'456\';
//]]>
</script>
';
$d=new DOMDocument();
$d->loadHTML($test_content);
echo $d->saveHTML();


Expected result:
----------------
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head>
<script>
//<![CDATA[
a='123';
b='</script>';
c='456';
//]]>
</script>
</head>
<body></body>
</html>


Actual result:
--------------
PHP Warning:  DOMDocument::loadHTML(): Unexpected end tag : script in Entity, line: 8 in C:\xampp\htdocs\test\dom_cdata.php on line 13

Warning: DOMDocument::loadHTML(): Unexpected end tag : script in Entity, line: 8 in C:\xampp\htdocs\test\dom_cdata.php on line 13
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><script>
//<![CDATA[
a='123';
b='</script></head><body><p>';
c='456';
//]]&gt;
</p></body></html>


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2017-07-05 07:16 UTC] requinix@php.net
-Status: Open +Status: Not a bug -Package: *General Issues +Package: DOM XML related
 [2017-07-05 07:16 UTC] requinix@php.net
CDATA sections are not interpreted in HTML (only in XML and XHTML) and HTML parsers are not aware of the Javascript language, so that </script> in the string will be treated as the end tag for the earlier <script>.

If you want a literal "</script>" then escape the slash like
  b='<\/script>';
This breaks apart the </ sequence and prevents the rest from being parsed as an end tag.

The other bug report may look similar but it is a different issue.
 [2017-07-06 09:02 UTC] qdinar at gmail dot com
requinix, loadHTML parses html tags also inside usual html comments: https://bugs.php.net/bug.php?id=74863 .
 [2017-07-09 14:07 UTC] qdinar at gmail dot com
requinix, i tried same script with this (instead of "echo $d->saveHTML();") :
$es=$d->getElementsByTagName('script');
echo $es->item(0)->childNodes->item(0)->nodeType;
and it outputted 4 which is:
XML_CDATA_SECTION_NODE (integer) 	4 	Node is a DOMCharacterData
- http://php.net/manual/en/dom.constants.php
 [2017-07-09 14:16 UTC] requinix@php.net
-Block user comment: No +Block user comment: Yes
 [2017-07-09 14:16 UTC] requinix@php.net
http://www.flightlab.com/~joe/sgml/cdata.html dated but still mostly accurate
#5 is what is not supported in HTML 4.

Look. Whether you consider the current behavior to be a bug ("libxml does not parse using current rules") or a request ("libxml should add support for HTML 5"), there is nothing for PHP to do here because this is not a PHP issue. It is a libxml issue. If you want to keep pushing for this then head over to their project.
http://xmlsoft.org/
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Mar 28 21:01:27 2024 UTC