php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #54369 [PATCH] parse_url() incorrectly determines the start of query and fragment parts
Submitted: 2011-03-24 15:46 UTC Modified: 2016-12-20 15:25 UTC
Votes:9
Avg. Score:4.2 ± 1.2
Reproduced:8 of 9 (88.9%)
Same Version:4 (50.0%)
Same OS:2 (25.0%)
From: tomas dot brastavicius at quantum dot lt Assigned:
Status: Duplicate Package: URL related
PHP Version: Irrelevant OS:
Private report: No CVE-ID: None
Anyone can comment on a bug. Have a simpler test case? Does it work for you on a different platform? Let us know!
Just going to say 'Me too!'? Don't clutter the database with that please !
Your email address:
MUST BE VALID
Solve the problem:
29 - 18 = ?
Subscribe to this entry?

 
 [2011-03-24 15:46 UTC] tomas dot brastavicius at quantum dot lt
Description:
------------
Attached patch fixes the issue.

Test script:
---------------
$url = 'http://www.example.com#fra/gment';
echo $url . "\n";
var_dump(parse_url($url));

$url = 'http://www.example.com?p=1/param';
echo $url . "\n";
var_dump(parse_url($url));

// No host, should return false
$url = 'http://#fra/gment';
echo $url . "\n";
var_dump(parse_url($url));

// No host, should return false
$url = 'http://?p=1/param';
echo $url . "\n";
var_dump(parse_url($url));

Expected result:
----------------
http://www.example.com#fra/gment
array(3) {
  ["scheme"]=>
  string(4) "http"
  ["host"]=>
  string(15) "www.example.com"
  ["fragment"]=>
  string(9) "fra/gment"
}
http://www.example.com?p=1/param
array(3) {
  ["scheme"]=>
  string(4) "http"
  ["host"]=>
  string(15) "www.example.com"
  ["query"]=>
  string(9) "p=1/param"
}
http://#fra/gment
bool(false)
http://?p=1/param
bool(false)

Actual result:
--------------
http://www.example.com#fra/gment
array(3) {
  ["scheme"]=>
  string(4) "http"
  ["host"]=>
  string(19) "www.example.com#fra"
  ["path"]=>
  string(6) "/gment"
}
http://www.example.com?p=1/param
array(3) {
  ["scheme"]=>
  string(4) "http"
  ["host"]=>
  string(19) "www.example.com?p=1"
  ["path"]=>
  string(6) "/param"
}
http://#fra/gment
array(3) {
  ["scheme"]=>
  string(4) "http"
  ["host"]=>
  string(4) "#fra"
  ["path"]=>
  string(6) "/gment"
}
http://?p=1/param
array(3) {
  ["scheme"]=>
  string(4) "http"
  ["host"]=>
  string(4) "?p=1"
  ["path"]=>
  string(6) "/param"
}

Patches

bug54369_tests.patch (last revision 2011-03-24 15:30 UTC by tomas dot brastavicius at quantum dot lt)
parse_url.patch (last revision 2011-03-24 14:47 UTC by tomas dot brastavicius at quantum dot lt)

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2011-04-03 14:10 UTC] tokul at users dot sourceforge dot net
Check url encoding documentation first.
http://en.wikipedia.org/wiki/Percent-encoding

Then fix your $url value. You use reserved character for other purpose.
 [2011-04-03 18:09 UTC] tomas dot brastavicius at quantum dot lt
Another comment about this issue: http://marc.info/?l=php-internals&m=130183032307080&w=2


@Peter
Yes, according to RFC 1738 the test URLs are not valid. But:

1. It is not defined that parse_url() parses URLs according to RFC 1738.

2. parse_url() "is not meant to validate given URL". See http://php.net/manual/en/function.parse-url.php

3. Why it is better to return invalid hostname ("#" and "/" are invalid characters, current parse_url() version) instead of invalid query or fragment (patched parse_url() version) ?


@tokul at users dot sourceforge dot net
Checked


My arguments for the patch acceptance are as follows:

1. parse_url() documentation's "Return Values" section clearly states that query and fragment component starts after "?" and "#" character respectively.

2. I don't know any specification that allows "#" and "?" in the hostnames (someone knows ?) but I know at least RFC3986 (unfortunately I am working with) that allows "/" character in both query and fragment parts. See http://tools.ietf.org/html/rfc3986.html#section-3.4 and http://tools.ietf.org/html/rfc3986.html#section-3.5

3. It has been already stated (although different content) that parse_url() parses URLs according to RFC3986. See http://bugs.php.net/bug.php?id=50484. May be Adam Harvey knows more ?
 [2011-04-03 18:36 UTC] tomas dot brastavicius at quantum dot lt
One more comment about this issue: http://marc.info/?l=php-internals&m=130183094107548&w=2
 [2011-04-03 19:36 UTC] tokul at users dot sourceforge dot net
You can't argue that function is broken and needs fixes, if you feed broken data and expect good output. Use valid urls in your tests, if you want to show that function is broken.
 [2011-05-17 20:12 UTC] tomas dot brastavicius at quantum dot lt
-Summary: parse_url() incorrectly determines the start of query and fragment parts of URL +Summary: [PATCH] parse_url() incorrectly determines the start of query and fragment parts
 [2011-05-17 20:12 UTC] tomas dot brastavicius at quantum dot lt
Changed report name as described in the bug report spec.
 [2011-06-29 21:37 UTC] lenzai2004-dev at yahoo dot com
The point is not about wether the patch is relevant or not.

But for this bug and other cases, parse_url is returning corrupt result.
It could be fixed in 2 ways:
- patch it 
- or detect invalid url and return error.

I've been trying to use this function and after significant volume of URLs I always find cases where it returns incorrect data.
I had to rewrite everything in PHP and it's quite slow.
 [2013-05-31 13:09 UTC] woody dot gilk at gmail dot com
According to RFC, the URL http://www.example.com?foo=bar is a completely valid 
URL. To quote:

> For example, the URI <mailto:fred@example.com> has a path of "fred@example.com",
  whereas the URI <foo://info.example.com?fred> has an empty path.

There is nothing in the RFC spec that says a path must be included in the URL. 
Please fix this bug.
 [2015-06-05 14:43 UTC] cmb@php.net
I'm not sure whether the issue qualifies as *bug* in PHP. The
documentation contains the following note[1]:

> This function is intended specifically for the purpose of
> parsing URLs and not URIs.

Therefore RFC 1738 is relevant, not RFC 3986. However, RFC 1738 is
obsolete...

Anyhow, I had a look at the tests patch, and quite obviously the
main patch enforces some behavioral changes. That might be
considered a BC break for PHP 5, so perhaps it's best to treat
this ticket as feature request, and to improve parse_url() for PHP
7 only.

As the behavioral changes don't appear to cause profound BC
breaks, it seems to me that these changes don't require an RFC. A
PR[2] might be helpful to get more attention to this issue,
though. Are you willing to make a PR, Tomas?

[1] <http://php.net/manual/en/function.parse-url.php#refsect1-function.parse-url-notes>
[2] <https://github.com/php/php-src/pulls>
 [2015-06-05 14:45 UTC] cmb@php.net
Oh, I forgot: <http://3v4l.org/PKG0q>.
 [2016-08-08 21:28 UTC] acm at tweakers dot net
Apparently, this hasn't been fixed in 7.0

The reference to (deprecated) RFC1738, is actually interesting. In that RFC, a '?' or '#' is not a valid part of the 'hostname' (only alphanum, '.' and '-' are valid). So regardless of which of the two are supported, the host should not contain those (i.e. the url is invalid according to RFC1738)

RFC 1738 seems to require a non-empty (i.e. '/') path if there is also a 'search'. And it has no support for fragments (so why is that in parse_url? ;) )

Anyway, I'd consider this report to be both valid against RFC1738 and RFC3986.

I'm not sure why you'd think users of parse_url would expect the reported outcome - that is simply not a valid hostname (not in RFC1738 nor in RFC3986) - rather than either a false (invalid url) or host+query or host+fragment.
 [2016-08-21 11:42 UTC] cmb@php.net
> I'm not sure why you'd think users of parse_url would expect the
> reported outcome […]

It's not (only) about the given cases, but rather a general matter
of BC break. Applying the given patch against current PHP-5.6
lets 9 out of 42 test cases fail.

It would be necessary to investigate on that further, and perhaps
the patch would need to be updated/corrected. A PR would be
welcome!
 [2016-08-21 15:54 UTC] acm at tweakers dot net
> It's not (only) about the given cases, but rather a general matter
> of BC break. Applying the given patch against current PHP-5.6
> lets 9 out of 42 test cases fail.
I was reading your comment as arguing against the reported issue, rather than the provided patch ;)

> It would be necessary to investigate on that further, and perhaps
> the patch would need to be updated/corrected. A PR would be
> welcome!
If the patch doesn't just fix the issue, but breaks on valid url's, than it should obviously be fixed or replaced by another patch.

Unfortunately, my C-knowledge is limited. The php_url_parse_ex-function is therefore way to complicated for me to fully understand, let alone fix it..
 [2016-12-20 15:11 UTC] tomas dot brastavicius at quantum dot lt
This bug has been fixed in https://github.com/php/php-src/commit/b061fa909de77085d3822a89ab901b934d0362c4

The patch has been applied in >=5.6.28, >=7.0.13 and >=7.1.0 versions

One can check this on https://3v4l.org/
 [2016-12-20 15:25 UTC] requinix@php.net
-Status: Open +Status: Duplicate
 [2016-12-20 15:25 UTC] requinix@php.net
Fixed with bug #73192 in 5.6.28/7.0.13/7.1
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Oct 31 22:01:27 2024 UTC