PHP :: Bug #69679 :: DOMDocument::loadHTML refuses to accept NULL bytes

Bug #69679

DOMDocument::loadHTML refuses to accept NULL bytes

Submitted:

2015-05-21 11:11 UTC

Modified:

2019-04-16 14:11 UTC

Votes:	3
Avg. Score:	4.7 ± 0.5
Reproduced:	3 of 3 (100.0%)
Same Version:	1 (33.3%)
Same OS:	1 (33.3%)

From:

joe dot afflerbach+phpnet at sevenval dot com

Assigned:

cmb (profile)

Status:

Closed

Package:

DOM XML related

PHP Version:

5.6.9

OS:

Linux

Private report:

CVE-ID:

None

View Developer Edit

[2015-05-21 11:11 UTC] joe dot afflerbach+phpnet at sevenval dot com

Description:
------------
PHP 5.6.8 has introduced a regression when loading HTML documents containing NUL characters (U+0000) like this one here:

$ hexdump -C /tmp/test.html
00000000  3c 21 44 4f 43 54 59 50  45 20 68 74 6d 6c 3e 0a  |<!DOCTYPE html>.|
00000010  3c 68 74 6d 6c 3e 0a 20  20 3c 68 65 61 64 3e 0a  |<html>.  <head>.|
00000020  20 20 20 20 3c 6d 65 74  61 20 63 68 61 72 73 65  |    <meta charse|
00000030  74 3d 22 55 54 46 2d 38  22 3e 0a 20 20 3c 2f 68  |t="UTF-8">.  </h|
00000040  65 61 64 3e 0a 20 20 3c  62 6f 64 79 3e 0a 20 20  |ead>.  <body>.  |
00000050  55 2b 30 30 30 30 20 3c  73 70 61 6e 3e 00 3c 2f  |U+0000 <span>.</|
00000060  73 70 61 6e 3e 0a 20 20  3c 2f 62 6f 64 79 3e 0a  |span>.  </body>.|
00000070  3c 2f 68 74 6d 6c 3e                              |</html>|
00000077

Note the NULL byte in the "span" element.

-----------------------------------------------------------------

In PHP 5.6.7 it worked as follows:

$ php-5.6.7 -ddisplay_errors=1 -r '$d = new DOMDocument(); $d->loadHTML(file_get_contents("/tmp/test.html")); print("Result: >>>" . $d->saveHTML() . "<<<");'
Result: >>><!DOCTYPE html>
<html><head><meta charset="UTF-8"></head><body>U+0000 <span></span></body></html>
<<<

No parser errors. The document is dumped. The U+0000 character is suppressed by libxml’s HTML parser, though.

-----------------------------------------------------------------

The same script executed with a newer PHP (5.6.8 or 5.6.9):

$ php-5.6.9 -ddisplay_errors=1 -r '$d = new DOMDocument(); $d->loadHTML(file_get_contents("/tmp/test.html")); print("Result: >>>" . $d->saveHTML() . "<<<");'

Warning: DOMDocument::loadHTML() expects parameter 1 to be a valid path, string given in Command line code on line 1
Result: >>>
<<<

The HTML content is considered a path here (instead a string as it used to be) and therefore denied as it contains a NULL byte. The document is not parsed at all.

The cause seems to be https://github.com/php/php-src/commit/4435b9142ff9813845d5c97ab29a5d637bedb257#diff-be46ae64a0441a27f014f6404f38a0b9L2171

Here, loadHTML and loadHTMLFile should be treated differently.

-----------------------------------------------------------------

$ php-5.6.7 -v

PHP 5.6.7 (cli) (built: May 20 2015 13:46:58)
Copyright (c) 1997-2015 The PHP Group
Zend Engine v2.6.0, Copyright (c) 1998-2015 Zend Technologies

$ php-5.6.9 -v
PHP 5.6.9 (cli) (built: May 20 2015 13:26:36)
Copyright (c) 1997-2015 The PHP Group
Zend Engine v2.6.0, Copyright (c) 1998-2015 Zend Technologies




Test script:
---------------
<?php

ini_set('display_errors', 1);
$d = new DOMDocument();
$html = "<!DOCTYPE html><html><head><meta charset='UTF-8'></head><body>U+0000 <span>\x0</span></body></html>";
$d->loadHTML($html);
print($d->saveHTML());


Expected result:
----------------
Behaviour for DOMDocument::loadHTML should be the same as in PHP 5.6.7.

Patches

Pull Requests

Pull requests:

Fix bug #69679: DOMDocument::loadHTML refuses to accept NULL bytes (php-src/1296)

History

AllCommentsChangesGit/SVN commitsRelated reports

[2015-05-21 11:59 UTC] cmb@php.net

-Status: Open +Status: Analyzed

[2015-05-21 11:59 UTC] cmb@php.net

Bug confirmed: <http://3v4l.org/ZkKkD>.

> Here, loadHTML and loadHTMLFile should be treated differently.

The relevant change appears to be using `p` instead of `s` for
ZPP[1], what would have to be guarded by the mode parameter.

[1] <https://github.com/php/php-src/commit/4435b9142ff9813845d5c97ab29a5d637bedb257#diff-be46ae64a0441a27f014f6404f38a0b9L2171>

[2015-05-21 12:28 UTC] cmb@php.net

-Assigned To: +Assigned To: cmb

[2015-05-21 12:39 UTC] cmb@php.net

-Assigned To: cmb +Assigned To:

[2015-06-26 23:25 UTC] cmb@php.net

-Status: Analyzed +Status: Closed -Assigned To: +Assigned To: cmb

[2015-06-26 23:25 UTC] cmb@php.net

This bug has been already fixed as of PHP 5.6.10, see <http://3v4l.org/ZkKkD>.

[2019-04-03 21:38 UTC] roger21 at free dot fr

this problem exists when the null character is in the head:

https://3v4l.org/pTGEr

input: "<!DOCTYPE html><html><head><meta charset='UTF-8'><title>a title with a \x0</title></head><body>U+0000 <span>\x0</span></body></html>"

output: "<!DOCTYPE html> <html><head><meta charset="UTF-8"><title>a title with a</title></head></html>"

instead of expected: "<!DOCTYPE html> <html><head><meta charset="UTF-8"><title>a title with a </title></head><body>U+0000 <span></span></body></html>"

everything after the null character is not parsed

without errors or warnings

[2019-04-03 22:21 UTC] roger21 at free dot fr

it actually never works?

https://3v4l.org/4plgn

https://3v4l.org/MqaAY

everything after a null character is not parsed

[2019-04-16 14:11 UTC] cmb@php.net

@roger21, yours is a different issue.  Please open a new ticket.

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2026 The PHP Group All rights reserved.	Last updated: Mon Mar 16 11:00:02 2026 UTC