|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
PatchesPull Requests
Pull requests:
HistoryAllCommentsChangesGit/SVN commits
[2015-05-21 11:59 UTC] cmb@php.net
-Status: Open
+Status: Analyzed
[2015-05-21 11:59 UTC] cmb@php.net
[2015-05-21 12:28 UTC] cmb@php.net
-Assigned To:
+Assigned To: cmb
[2015-05-21 12:39 UTC] cmb@php.net
-Assigned To: cmb
+Assigned To:
[2015-06-26 23:25 UTC] cmb@php.net
-Status: Analyzed
+Status: Closed
-Assigned To:
+Assigned To: cmb
[2015-06-26 23:25 UTC] cmb@php.net
[2019-04-03 21:38 UTC] roger21 at free dot fr
[2019-04-03 22:21 UTC] roger21 at free dot fr
[2019-04-16 14:11 UTC] cmb@php.net
|
|||||||||||||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Sun Dec 14 13:00:01 2025 UTC |
Description: ------------ PHP 5.6.8 has introduced a regression when loading HTML documents containing NUL characters (U+0000) like this one here: $ hexdump -C /tmp/test.html 00000000 3c 21 44 4f 43 54 59 50 45 20 68 74 6d 6c 3e 0a |<!DOCTYPE html>.| 00000010 3c 68 74 6d 6c 3e 0a 20 20 3c 68 65 61 64 3e 0a |<html>. <head>.| 00000020 20 20 20 20 3c 6d 65 74 61 20 63 68 61 72 73 65 | <meta charse| 00000030 74 3d 22 55 54 46 2d 38 22 3e 0a 20 20 3c 2f 68 |t="UTF-8">. </h| 00000040 65 61 64 3e 0a 20 20 3c 62 6f 64 79 3e 0a 20 20 |ead>. <body>. | 00000050 55 2b 30 30 30 30 20 3c 73 70 61 6e 3e 00 3c 2f |U+0000 <span>.</| 00000060 73 70 61 6e 3e 0a 20 20 3c 2f 62 6f 64 79 3e 0a |span>. </body>.| 00000070 3c 2f 68 74 6d 6c 3e |</html>| 00000077 Note the NULL byte in the "span" element. ----------------------------------------------------------------- In PHP 5.6.7 it worked as follows: $ php-5.6.7 -ddisplay_errors=1 -r '$d = new DOMDocument(); $d->loadHTML(file_get_contents("/tmp/test.html")); print("Result: >>>" . $d->saveHTML() . "<<<");' Result: >>><!DOCTYPE html> <html><head><meta charset="UTF-8"></head><body>U+0000 <span></span></body></html> <<< No parser errors. The document is dumped. The U+0000 character is suppressed by libxml’s HTML parser, though. ----------------------------------------------------------------- The same script executed with a newer PHP (5.6.8 or 5.6.9): $ php-5.6.9 -ddisplay_errors=1 -r '$d = new DOMDocument(); $d->loadHTML(file_get_contents("/tmp/test.html")); print("Result: >>>" . $d->saveHTML() . "<<<");' Warning: DOMDocument::loadHTML() expects parameter 1 to be a valid path, string given in Command line code on line 1 Result: >>> <<< The HTML content is considered a path here (instead a string as it used to be) and therefore denied as it contains a NULL byte. The document is not parsed at all. The cause seems to be https://github.com/php/php-src/commit/4435b9142ff9813845d5c97ab29a5d637bedb257#diff-be46ae64a0441a27f014f6404f38a0b9L2171 Here, loadHTML and loadHTMLFile should be treated differently. ----------------------------------------------------------------- $ php-5.6.7 -v PHP 5.6.7 (cli) (built: May 20 2015 13:46:58) Copyright (c) 1997-2015 The PHP Group Zend Engine v2.6.0, Copyright (c) 1998-2015 Zend Technologies $ php-5.6.9 -v PHP 5.6.9 (cli) (built: May 20 2015 13:26:36) Copyright (c) 1997-2015 The PHP Group Zend Engine v2.6.0, Copyright (c) 1998-2015 Zend Technologies Test script: --------------- <?php ini_set('display_errors', 1); $d = new DOMDocument(); $html = "<!DOCTYPE html><html><head><meta charset='UTF-8'></head><body>U+0000 <span>\x0</span></body></html>"; $d->loadHTML($html); print($d->saveHTML()); Expected result: ---------------- Behaviour for DOMDocument::loadHTML should be the same as in PHP 5.6.7.