php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #73767 script in document.write (string) in <script> in html get corrupted
Submitted: 2016-12-16 23:38 UTC Modified: 2016-12-17 00:59 UTC
From: justin dot maxwell at tibit dot com Assigned:
Status: Not a bug Package: DOM XML related
PHP Version: 7.0.14 OS: Mint 18
Private report: No CVE-ID: None
 [2016-12-16 23:38 UTC] justin dot maxwell at tibit dot com
Description:
------------
---
From manual page: http://www.php.net/book.libxml
---
PHP 7.0.8-0ubuntu0.16.04.3 

I have no control of the site with this erm, <expletive> oldskool hacky code in it, but I need to parse it.

See the HTML snippet in the test script for details, but after loadHTML, saveHTML, the string parameter to document.write is missing a piece.

On input, it is: '<scr'+'ipt src="http://example.com/some.js"></scr'+'ipt>'
On output      : '<scr'+'ipt src="http://example.com/some.js">'+'ipt>'

Which of course wreaks havoc with the unclosed injected script tag.

Incidentally,  : '<scr'+'ipt src="http://example.com/some.js"><'+'/scr'+'ipt>'

... on first glance, seems to be parsed without corruption.



Test script:
---------------
Using HTML as beneath

$doc= new DOMDocument();
$doc->loadHTMLFile('test-libxml.html');
$doc->saveHTML();


<!DOCTYPE html>
<html>
    <head>
        <title>Test libxml</title>
    </head>

    <body>
        <script type="text/javascript">
            document.write('<scr'+'ipt src="http://example.com/some.js"></scr'+'ipt>');
        </script>
    </body>
</html>


Expected result:
----------------
<!DOCTYPE html>
<html>
 <head>
  <title>Test libxml</title>
 </head>
 <body>
  <script type="text/javascript">
   document.write('<scr'+'ipt src="http://example.com/some.js">'+'ipt>');
  </script>
 </body>
</html>


Actual result:
--------------
<!DOCTYPE html>
<html>
 <head>
  <title>Test libxml</title>
 </head>
 <body>
  <script type="text/javascript">
   document.write('<scr'+'ipt src="http://example.com/some.js"></scr'+'ipt>');
  </script>
 </body>
</html>

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2016-12-17 00:59 UTC] yohgaki@php.net
-Status: Open +Status: Not a bug
 [2016-12-17 00:59 UTC] yohgaki@php.net
Besides chars in the string must be URL encoded or entity at least, and this kind of simple JavaScript injections must be prevented at program that generates JavaScript.

Why you think JavaScript code

'<scr'+'ipt src="http://example.com/some.js"></scr'+'ipt>'

must yield

'<scr'+'ipt src="http://example.com/some.js">'+'ipt>'

?

This is totally wrong thing to do. Contents inside script tag is CDATA and it must return string as it is.

BTW, if your system allows such string generation from user's inputs, there is no reliable way to prevent injection attacks.
 [2016-12-17 14:47 UTC] justin dot maxwell at tibit dot com
You have MISREAD what I wrote, and WRONGLY CLOSED this bug report.

The BUG(!) is that the string is NOT treated as* CDATA and the string is CHANGED.

You say:
 > Why you think JavaScript code  ...  some.js"></scr'+'ipt>'
 > must yield ...  some.js">'+'ipt>'

I DON'T THINK THAT! THAT IS WHAT IT DOES NOW! THAT IS WHAT THE BUG IS!

You would know this if you had tried the trivial sample provided.  I'm entitled to some CAPSLOCK.

This bug makes libxml, and so standard PHP, deficient for use as a simple filtering proxy, as it cannot be relied on to simply load served, VALID (albeit very poorly constructed) HTML, without corrupting character strings on certain (poorly constructed) sites in a way that can massively affect the way that the page is subsequently rendered.

Furthermore:  For what ought to be obvious reasons, you cannot say BOTH: 
 > chars in the string must be encoded, AND
 > Contents inside script tag is CDATA 

*In any event, per HTML5 the spec section 4.11.1.2 
"Restrictions for contents of script elements"

1) This is NOT 'CDATA' ('\' does not escape inside CDATA)

2) There is NO REQUIREMENT to encode characters inside script elements.  

In particular, see the code snippet in the HTML5 spec section referenced above that is preceded by the words "the problem is avoided entirely:" that recommends an approach that includes unencoded < >.

---

An ending word:

I am hugely grateful for all the developers who help maintain and enhance PHP and similar projects.  What your efforts, much of it voluntary, mean, is not lost on me.

But I am annoyed at this response, because after hours tracking down this obscure, weird, freaky bug, and then taking the time to file a bug report, including a complete test case requiring three-lines of PHP -- rather than just execute that, the responder has basically not bothered to properly read the report, assumed I'm an idiot, told me I'm stupid (in not so many words), and CLOSED the bug report; when it would have taken perhaps thirty seconds to see that the bug is DOING what he is wrongly accusing me of wrongly WANTING.

And now, to bolster this, I've had to dive back into the depths of HTML5 specs, just to make sure that everything I previously understood, and am stating here, is, in fact, correct; and to bolster the case for re-opening this bug.  Which, incidentally, is no longer causing me a problem, because I preg_replace the problematic string to add the additional '+' per the third <scr... line in the report description, which is sufficient to have the server HTML load and save without corruption by libxml.  I'm just wanting to help the next person who hits it.

Again, thanks.  Will someone please re-open so I don't have to open a new report?

Cheers.
 [2016-12-17 16:30 UTC] justin dot maxwell at tibit dot com
BETTER EXAMPLE:

This uses javascript taken directly from an example at http://api.jquery.com/append/.
Otherwise, it is exactly the same behaviour and bug.


Input HTML:
 <!DOCTYPE html>
 <html>
  <head>
   <title>Test libxml</title>
  </head>
  <body>
   <script type="text/javascript">
    $( ".inner" ).append( "<p>Test</p>" );
   </script>
  </body>
 </html>
    

Process:
 $doc= new DOMDocument();
 $doc->loadHTML('text-libxml.html');
 $doc->saveHTML();


Output: (with added line breaks)
<!DOCTYPE html>
<html>
 <head>
  <title>Test libxml</title>
 </head>
 <body>
  <script type="text/javascript">
   $( ".inner" ).append( "<p>Test" );
  </script>
 </body>
</html>


Note that the closing </p> tag from the source JavaScript string has been stripped out by the DOMDocument HTML Load/Save.
 [2016-12-18 02:03 UTC] justin dot maxwell at tibit dot com
I have just seen that I copy-pasted the actual and expected output into the opposite/wrong boxes in the initial report.  I'm really really sorry, that goes some way to explaining the misunderstanding.  It still would have shown with running the provided test code, or reading of the description, but I apologize for my earlier angry reply.
 
PHP Copyright © 2001-2020 The PHP Group
All rights reserved.
Last updated: Mon Sep 28 12:01:23 2020 UTC