php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #69679 DOMDocument::loadHTML refuses to accept NULL bytes
Submitted: 2015-05-21 11:11 UTC Modified: 2019-04-16 14:11 UTC
Votes:3
Avg. Score:4.7 ± 0.5
Reproduced:3 of 3 (100.0%)
Same Version:1 (33.3%)
Same OS:1 (33.3%)
From: joe dot afflerbach+phpnet at sevenval dot com Assigned: cmb (profile)
Status: Closed Package: DOM XML related
PHP Version: 5.6.9 OS: Linux
Private report: No CVE-ID: None
 [2015-05-21 11:11 UTC] joe dot afflerbach+phpnet at sevenval dot com
Description:
------------
PHP 5.6.8 has introduced a regression when loading HTML documents containing NUL characters (U+0000) like this one here:

$ hexdump -C /tmp/test.html
00000000  3c 21 44 4f 43 54 59 50  45 20 68 74 6d 6c 3e 0a  |<!DOCTYPE html>.|
00000010  3c 68 74 6d 6c 3e 0a 20  20 3c 68 65 61 64 3e 0a  |<html>.  <head>.|
00000020  20 20 20 20 3c 6d 65 74  61 20 63 68 61 72 73 65  |    <meta charse|
00000030  74 3d 22 55 54 46 2d 38  22 3e 0a 20 20 3c 2f 68  |t="UTF-8">.  </h|
00000040  65 61 64 3e 0a 20 20 3c  62 6f 64 79 3e 0a 20 20  |ead>.  <body>.  |
00000050  55 2b 30 30 30 30 20 3c  73 70 61 6e 3e 00 3c 2f  |U+0000 <span>.</|
00000060  73 70 61 6e 3e 0a 20 20  3c 2f 62 6f 64 79 3e 0a  |span>.  </body>.|
00000070  3c 2f 68 74 6d 6c 3e                              |</html>|
00000077

Note the NULL byte in the "span" element.

-----------------------------------------------------------------

In PHP 5.6.7 it worked as follows:

$ php-5.6.7 -ddisplay_errors=1 -r '$d = new DOMDocument(); $d->loadHTML(file_get_contents("/tmp/test.html")); print("Result: >>>" . $d->saveHTML() . "<<<");'
Result: >>><!DOCTYPE html>
<html><head><meta charset="UTF-8"></head><body>U+0000 <span></span></body></html>
<<<

No parser errors. The document is dumped. The U+0000 character is suppressed by libxml’s HTML parser, though.

-----------------------------------------------------------------

The same script executed with a newer PHP (5.6.8 or 5.6.9):

$ php-5.6.9 -ddisplay_errors=1 -r '$d = new DOMDocument(); $d->loadHTML(file_get_contents("/tmp/test.html")); print("Result: >>>" . $d->saveHTML() . "<<<");'

Warning: DOMDocument::loadHTML() expects parameter 1 to be a valid path, string given in Command line code on line 1
Result: >>>
<<<

The HTML content is considered a path here (instead a string as it used to be) and therefore denied as it contains a NULL byte. The document is not parsed at all.

The cause seems to be https://github.com/php/php-src/commit/4435b9142ff9813845d5c97ab29a5d637bedb257#diff-be46ae64a0441a27f014f6404f38a0b9L2171

Here, loadHTML and loadHTMLFile should be treated differently.

-----------------------------------------------------------------

$ php-5.6.7 -v

PHP 5.6.7 (cli) (built: May 20 2015 13:46:58)
Copyright (c) 1997-2015 The PHP Group
Zend Engine v2.6.0, Copyright (c) 1998-2015 Zend Technologies

$ php-5.6.9 -v
PHP 5.6.9 (cli) (built: May 20 2015 13:26:36)
Copyright (c) 1997-2015 The PHP Group
Zend Engine v2.6.0, Copyright (c) 1998-2015 Zend Technologies




Test script:
---------------
<?php

ini_set('display_errors', 1);
$d = new DOMDocument();
$html = "<!DOCTYPE html><html><head><meta charset='UTF-8'></head><body>U+0000 <span>\x0</span></body></html>";
$d->loadHTML($html);
print($d->saveHTML());


Expected result:
----------------
Behaviour for DOMDocument::loadHTML should be the same as in PHP 5.6.7.



Patches

Add a Patch

Pull Requests

Pull requests:

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2015-05-21 11:59 UTC] cmb@php.net
-Status: Open +Status: Analyzed
 [2015-05-21 11:59 UTC] cmb@php.net
Bug confirmed: <http://3v4l.org/ZkKkD>.

> Here, loadHTML and loadHTMLFile should be treated differently.

The relevant change appears to be using `p` instead of `s` for
ZPP[1], what would have to be guarded by the mode parameter.

[1] <https://github.com/php/php-src/commit/4435b9142ff9813845d5c97ab29a5d637bedb257#diff-be46ae64a0441a27f014f6404f38a0b9L2171>
 [2015-05-21 12:28 UTC] cmb@php.net
-Assigned To: +Assigned To: cmb
 [2015-05-21 12:39 UTC] cmb@php.net
-Assigned To: cmb +Assigned To:
 [2015-06-26 23:25 UTC] cmb@php.net
-Status: Analyzed +Status: Closed -Assigned To: +Assigned To: cmb
 [2015-06-26 23:25 UTC] cmb@php.net
This bug has been already fixed as of PHP 5.6.10, see <http://3v4l.org/ZkKkD>.
 [2019-04-03 21:38 UTC] roger21 at free dot fr
this problem exists when the null character is in the head:

https://3v4l.org/pTGEr

input: "<!DOCTYPE html><html><head><meta charset='UTF-8'><title>a title with a \x0</title></head><body>U+0000 <span>\x0</span></body></html>"

output: "<!DOCTYPE html> <html><head><meta charset="UTF-8"><title>a title with a</title></head></html>"

instead of expected: "<!DOCTYPE html> <html><head><meta charset="UTF-8"><title>a title with a </title></head><body>U+0000 <span></span></body></html>"

everything after the null character is not parsed

without errors or warnings
 [2019-04-03 22:21 UTC] roger21 at free dot fr
it actually never works?

https://3v4l.org/4plgn

https://3v4l.org/MqaAY

everything after a null character is not parsed
 [2019-04-16 14:11 UTC] cmb@php.net
@roger21, yours is a different issue.  Please open a new ticket.
 
PHP Copyright © 2001-2021 The PHP Group
All rights reserved.
Last updated: Sun Mar 07 07:01:23 2021 UTC