php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #55374 DOMDocument::LoadHTMLFile fails with %xx sequences in filename.
Submitted: 2011-08-06 06:37 UTC Modified: 2021-04-08 14:35 UTC
Votes:11
Avg. Score:4.2 ± 0.8
Reproduced:9 of 10 (90.0%)
Same Version:2 (22.2%)
Same OS:4 (44.4%)
From: keithm at aoeex dot com Assigned:
Status: Verified Package: XML related
PHP Version: 5.4.0alpha3 OS: Linux
Private report: No CVE-ID: None
Have you experienced this issue?
Rate the importance of this bug to you:

 [2011-08-06 06:37 UTC] keithm at aoeex dot com
Description:
------------
DOMDocument::LoadHTMLFile appears to urldecode it's argument, which causes 
problems when attempting to load a file containing a %xx sequence.

This issue was brought up on ##php in freenode when someone was attempting to load 
a file named 'Linux_Files%2Fetc%2Fbash.bashrc.html'.  Suggested work around was to 
use LoadHTML + file_get_contents instead.

There was a small debate over whether this is a bug, or just a documentation 
problem (perhaps LoadHTMLFile expects a URL).

DOMDocument::Load() is also affected.

Test script:
---------------
Contents of 'Linux_Files%2Fetc%2Fbash.bashrc.html'

---------------------------------------8<---------------------------------------
<html>
 <head>
  <title></title>
 </head>
 <body>
 </body>
</html>
---------------------------------------8<---------------------------------------


contents of 'test.php'
---------------------------------------8<---------------------------------------
<?php

$file = 'Linux_Files%2Fetc%2Fbash.bashrc.html';

$doc = new DOMDocument();
$doc->loadHTMLFile($file);
var_dump($doc->getElementsByTagName('body')->length);

echo str_repeat('-', 80), "\r\n";

$doc2 = new DOMDocument();
$doc2->loadHTMLFile(urlencode($file));
var_dump($doc2->getElementsByTagName('body')->length);
---------------------------------------8<---------------------------------------


Expected result:
----------------
Expect the ->loadHTMLFile($file) to succeed and the -
>loadHTMLFile(urlencode($file)) to fail with a file-not-found type error.

Actual result:
--------------
->loadHTMLFile($file) failes with errors:

PHP Warning:  DOMDocument::loadHTMLFile(): I/O warning : failed to load external 
entity "Linux_Files%2Fetc%2Fbash.bashrc.html" in /home/kicken/test.php on line 6

Warning: DOMDocument::loadHTMLFile(): I/O warning : failed to load external entity 
"Linux_Files%2Fetc%2Fbash.bashrc.html" in /home/kicken/test.php on line 6


->loadHTMLFile(urlencode($file)) succeeds.


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2013-12-02 16:34 UTC] mike@php.net
-Type: Bug +Type: Documentation Problem
 [2021-04-08 14:35 UTC] cmb@php.net
-Status: Open +Status: Verified -Package: DOM XML related +Package: XML related
 [2021-04-08 14:35 UTC] cmb@php.net
Hmm, this looks like a bug to me.  At least the behavior is
inconsistent with the general file functions, and it is not
directly related to libxml2.  Actually, the bug fix appears to be
trivial:


 ext/libxml/libxml.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ext/libxml/libxml.c b/ext/libxml/libxml.c
index fc194770e1..68fd0e37c7 100644
--- a/ext/libxml/libxml.c
+++ b/ext/libxml/libxml.c
@@ -305,7 +305,7 @@ static void *php_libxml_streams_IO_open_wrapper(const char *filename, const char
 
 
 	uri = xmlParseURI(filename);
-	if (uri && (uri->scheme == NULL ||
+	if (uri && (
 			(xmlStrncmp(BAD_CAST uri->scheme, BAD_CAST "file", 4) == 0))) {
 		resolved_path = xmlURIUnescapeString(filename, 0, NULL);
 		isescaped = 1;


But given the long standing behavior, and the BC break, this
shouldn't be fixed for any stable version (and probably only in a
new major version), so yes, the behavior should be documented.

And to be clear: this affects all XML extensions which use
libxml2, not only DOM.
 
PHP Copyright © 2001-2021 The PHP Group
All rights reserved.
Last updated: Mon Oct 18 17:03:34 2021 UTC