php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #69679 DOMDocument::loadHTML refuses to accept NULL bytes
Submitted: 2015-05-21 11:11 UTC Modified: 2019-04-16 14:11 UTC
Votes:3
Avg. Score:4.7 ± 0.5
Reproduced:3 of 3 (100.0%)
Same Version:1 (33.3%)
Same OS:1 (33.3%)
From: joe dot afflerbach+phpnet at sevenval dot com Assigned: cmb (profile)
Status: Closed Package: DOM XML related
PHP Version: 5.6.9 OS: Linux
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: joe dot afflerbach+phpnet at sevenval dot com
New email:
PHP Version: OS:

 

 [2015-05-21 11:11 UTC] joe dot afflerbach+phpnet at sevenval dot com
Description:
------------
PHP 5.6.8 has introduced a regression when loading HTML documents containing NUL characters (U+0000) like this one here:

$ hexdump -C /tmp/test.html
00000000  3c 21 44 4f 43 54 59 50  45 20 68 74 6d 6c 3e 0a  |<!DOCTYPE html>.|
00000010  3c 68 74 6d 6c 3e 0a 20  20 3c 68 65 61 64 3e 0a  |<html>.  <head>.|
00000020  20 20 20 20 3c 6d 65 74  61 20 63 68 61 72 73 65  |    <meta charse|
00000030  74 3d 22 55 54 46 2d 38  22 3e 0a 20 20 3c 2f 68  |t="UTF-8">.  </h|
00000040  65 61 64 3e 0a 20 20 3c  62 6f 64 79 3e 0a 20 20  |ead>.  <body>.  |
00000050  55 2b 30 30 30 30 20 3c  73 70 61 6e 3e 00 3c 2f  |U+0000 <span>.</|
00000060  73 70 61 6e 3e 0a 20 20  3c 2f 62 6f 64 79 3e 0a  |span>.  </body>.|
00000070  3c 2f 68 74 6d 6c 3e                              |</html>|
00000077

Note the NULL byte in the "span" element.

-----------------------------------------------------------------

In PHP 5.6.7 it worked as follows:

$ php-5.6.7 -ddisplay_errors=1 -r '$d = new DOMDocument(); $d->loadHTML(file_get_contents("/tmp/test.html")); print("Result: >>>" . $d->saveHTML() . "<<<");'
Result: >>><!DOCTYPE html>
<html><head><meta charset="UTF-8"></head><body>U+0000 <span></span></body></html>
<<<

No parser errors. The document is dumped. The U+0000 character is suppressed by libxml’s HTML parser, though.

-----------------------------------------------------------------

The same script executed with a newer PHP (5.6.8 or 5.6.9):

$ php-5.6.9 -ddisplay_errors=1 -r '$d = new DOMDocument(); $d->loadHTML(file_get_contents("/tmp/test.html")); print("Result: >>>" . $d->saveHTML() . "<<<");'

Warning: DOMDocument::loadHTML() expects parameter 1 to be a valid path, string given in Command line code on line 1
Result: >>>
<<<

The HTML content is considered a path here (instead a string as it used to be) and therefore denied as it contains a NULL byte. The document is not parsed at all.

The cause seems to be https://github.com/php/php-src/commit/4435b9142ff9813845d5c97ab29a5d637bedb257#diff-be46ae64a0441a27f014f6404f38a0b9L2171

Here, loadHTML and loadHTMLFile should be treated differently.

-----------------------------------------------------------------

$ php-5.6.7 -v

PHP 5.6.7 (cli) (built: May 20 2015 13:46:58)
Copyright (c) 1997-2015 The PHP Group
Zend Engine v2.6.0, Copyright (c) 1998-2015 Zend Technologies

$ php-5.6.9 -v
PHP 5.6.9 (cli) (built: May 20 2015 13:26:36)
Copyright (c) 1997-2015 The PHP Group
Zend Engine v2.6.0, Copyright (c) 1998-2015 Zend Technologies




Test script:
---------------
<?php

ini_set('display_errors', 1);
$d = new DOMDocument();
$html = "<!DOCTYPE html><html><head><meta charset='UTF-8'></head><body>U+0000 <span>\x0</span></body></html>";
$d->loadHTML($html);
print($d->saveHTML());


Expected result:
----------------
Behaviour for DOMDocument::loadHTML should be the same as in PHP 5.6.7.



Patches

Pull Requests

Pull requests:

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2015-05-21 11:59 UTC] cmb@php.net
-Status: Open +Status: Analyzed
 [2015-05-21 11:59 UTC] cmb@php.net
Bug confirmed: <http://3v4l.org/ZkKkD>.

> Here, loadHTML and loadHTMLFile should be treated differently.

The relevant change appears to be using `p` instead of `s` for
ZPP[1], what would have to be guarded by the mode parameter.

[1] <https://github.com/php/php-src/commit/4435b9142ff9813845d5c97ab29a5d637bedb257#diff-be46ae64a0441a27f014f6404f38a0b9L2171>
 [2015-05-21 12:28 UTC] cmb@php.net
-Assigned To: +Assigned To: cmb
 [2015-05-21 12:39 UTC] cmb@php.net
-Assigned To: cmb +Assigned To:
 [2015-06-26 23:25 UTC] cmb@php.net
-Status: Analyzed +Status: Closed -Assigned To: +Assigned To: cmb
 [2015-06-26 23:25 UTC] cmb@php.net
This bug has been already fixed as of PHP 5.6.10, see <http://3v4l.org/ZkKkD>.
 [2019-04-03 21:38 UTC] roger21 at free dot fr
this problem exists when the null character is in the head:

https://3v4l.org/pTGEr

input: "<!DOCTYPE html><html><head><meta charset='UTF-8'><title>a title with a \x0</title></head><body>U+0000 <span>\x0</span></body></html>"

output: "<!DOCTYPE html> <html><head><meta charset="UTF-8"><title>a title with a</title></head></html>"

instead of expected: "<!DOCTYPE html> <html><head><meta charset="UTF-8"><title>a title with a </title></head><body>U+0000 <span></span></body></html>"

everything after the null character is not parsed

without errors or warnings
 [2019-04-03 22:21 UTC] roger21 at free dot fr
it actually never works?

https://3v4l.org/4plgn

https://3v4l.org/MqaAY

everything after a null character is not parsed
 [2019-04-16 14:11 UTC] cmb@php.net
@roger21, yours is a different issue.  Please open a new ticket.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Dec 27 03:01:28 2024 UTC