|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2016-12-16 23:38 UTC] justin dot maxwell at tibit dot com
Description: ------------ --- From manual page: http://www.php.net/book.libxml --- PHP 7.0.8-0ubuntu0.16.04.3 I have no control of the site with this erm, <expletive> oldskool hacky code in it, but I need to parse it. See the HTML snippet in the test script for details, but after loadHTML, saveHTML, the string parameter to document.write is missing a piece. On input, it is: '<scr'+'ipt src="http://example.com/some.js"></scr'+'ipt>' On output : '<scr'+'ipt src="http://example.com/some.js">'+'ipt>' Which of course wreaks havoc with the unclosed injected script tag. Incidentally, : '<scr'+'ipt src="http://example.com/some.js"><'+'/scr'+'ipt>' ... on first glance, seems to be parsed without corruption. Test script: --------------- Using HTML as beneath $doc= new DOMDocument(); $doc->loadHTMLFile('test-libxml.html'); $doc->saveHTML(); <!DOCTYPE html> <html> <head> <title>Test libxml</title> </head> <body> <script type="text/javascript"> document.write('<scr'+'ipt src="http://example.com/some.js"></scr'+'ipt>'); </script> </body> </html> Expected result: ---------------- <!DOCTYPE html> <html> <head> <title>Test libxml</title> </head> <body> <script type="text/javascript"> document.write('<scr'+'ipt src="http://example.com/some.js">'+'ipt>'); </script> </body> </html> Actual result: -------------- <!DOCTYPE html> <html> <head> <title>Test libxml</title> </head> <body> <script type="text/javascript"> document.write('<scr'+'ipt src="http://example.com/some.js"></scr'+'ipt>'); </script> </body> </html> PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Thu Oct 30 16:00:01 2025 UTC |
You have MISREAD what I wrote, and WRONGLY CLOSED this bug report. The BUG(!) is that the string is NOT treated as* CDATA and the string is CHANGED. You say: > Why you think JavaScript code ... some.js"></scr'+'ipt>' > must yield ... some.js">'+'ipt>' I DON'T THINK THAT! THAT IS WHAT IT DOES NOW! THAT IS WHAT THE BUG IS! You would know this if you had tried the trivial sample provided. I'm entitled to some CAPSLOCK. This bug makes libxml, and so standard PHP, deficient for use as a simple filtering proxy, as it cannot be relied on to simply load served, VALID (albeit very poorly constructed) HTML, without corrupting character strings on certain (poorly constructed) sites in a way that can massively affect the way that the page is subsequently rendered. Furthermore: For what ought to be obvious reasons, you cannot say BOTH: > chars in the string must be encoded, AND > Contents inside script tag is CDATA *In any event, per HTML5 the spec section 4.11.1.2 "Restrictions for contents of script elements" 1) This is NOT 'CDATA' ('\' does not escape inside CDATA) 2) There is NO REQUIREMENT to encode characters inside script elements. In particular, see the code snippet in the HTML5 spec section referenced above that is preceded by the words "the problem is avoided entirely:" that recommends an approach that includes unencoded < >. --- An ending word: I am hugely grateful for all the developers who help maintain and enhance PHP and similar projects. What your efforts, much of it voluntary, mean, is not lost on me. But I am annoyed at this response, because after hours tracking down this obscure, weird, freaky bug, and then taking the time to file a bug report, including a complete test case requiring three-lines of PHP -- rather than just execute that, the responder has basically not bothered to properly read the report, assumed I'm an idiot, told me I'm stupid (in not so many words), and CLOSED the bug report; when it would have taken perhaps thirty seconds to see that the bug is DOING what he is wrongly accusing me of wrongly WANTING. And now, to bolster this, I've had to dive back into the depths of HTML5 specs, just to make sure that everything I previously understood, and am stating here, is, in fact, correct; and to bolster the case for re-opening this bug. Which, incidentally, is no longer causing me a problem, because I preg_replace the problematic string to add the additional '+' per the third <scr... line in the report description, which is sufficient to have the server HTML load and save without corruption by libxml. I'm just wanting to help the next person who hits it. Again, thanks. Will someone please re-open so I don't have to open a new report? Cheers.