php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #52923 parse_url corrupts some UTF-8 strings
Submitted: 2010-09-25 11:09 UTC Modified: 2020-01-19 09:07 UTC
Votes:40
Avg. Score:4.1 ± 1.1
Reproduced:32 of 33 (97.0%)
Same Version:12 (37.5%)
Same OS:10 (31.2%)
From: masteram at gmail dot com Assigned:
Status: Open Package: URL related
PHP Version: 5.3.3 OS: MS Windows XP
Private report: No CVE-ID: None
 [2010-09-25 11:09 UTC] masteram at gmail dot com
Description:
------------
I have tested this with PHP 5.2.9 and 5.3.3.
Some UTF-8 strings are not being processed correctly by parse_url.
In the given example, the result of the evaluation of strings which contains the chars 'ם' or 'א' is corrupt, whereas the string 'מישהו'(which does not contain the above chars) is being processed correctly.
The affected characters (in UTF-8) are comprised of the following bytes:
ם - d7|9d
א - d7|90

Those are converted to a char which contains the following bytes: d7|5f.

In addition to ruining the url, this char is not safe with preg_replace.
Therefore, if we merge the result of parse_url back into a string, and then attempting to replace, say, spaces with underscores using preg_replace, we will get an empty string.

I believe that this is similar to bug #26391.

Test script:
---------------
$url = 'http://www.mysite.org/he/פרויקטים/ByYear.html';
$url = parse_url($url); //$url['path'] is now corrupt

$url = preg_replace('/\s+/u','_',$url['path']); //$url is now undefined

Expected result:
----------------
The correct portion of the url.

Actual result:
--------------
Corrupt string (or blank after using preg_replace).

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2010-09-25 11:42 UTC] rasmus@php.net
-Type: Bug +Type: Feature/Change Request
 [2010-09-25 11:42 UTC] rasmus@php.net
Reclassifying as a feature request.  parse_url has never been multibyte-aware.
 [2010-09-25 12:15 UTC] pajoye@php.net
What's about a parse_url_utf8, like what we have for IDN? It could be easy to implement it using either native OS APIs (when available) or using external libraries (there is a couple of good one out there).
 [2010-09-25 14:19 UTC] cataphract@php.net
I'd say this request/bug is bogus because such URL is not valid according to RFC 3986. He should first percent-encode all the characters that are unreserved (perhaps after doing some unicode normalization) and only then parse the URL.
 [2010-09-25 14:34 UTC] pajoye@php.net
It is not a bogus request. The idea would also to get the decoded (to UTF-8) URL elements as result. It is also a good complement to IDN support
 [2010-09-25 16:22 UTC] masteram at gmail dot com
I tend to agree with Pajoye.
Although RFC-3986 calls for the use of percent-encoding for URLs, I believe that it also mentions the IDN format (and the way things look today, there is a host of websites that use UTF-8 encoding, which benefits the readability of internationalized urls). 
I admit not being an expert in URL encoding, but it seems to me that corrupting a string, even if it does not meet the current standards, is a bad habit.
In addition, utf-8 encoded URLs seem to be quite common on reality. Take the international versions of Wikipedia as an example.
If I'm wrong about that, I would be more than happy to know it.

I am not sure that the encode-analyze-merge-decode procedure is really the best choice. Perhaps the streamlined alternative should be considered. It sure wouldn't hurt.
I, for one, am currently using 'ASCII-only' URLs.
 [2010-09-26 09:46 UTC] cataphract@php.net
The problem is that nothing guarantees a percent-encoded URL should be interpreted as containing UTF-8 data or that an (invalid) URL containing non-encoded unreserved characters should be converted to UTF-8 before being percent-encoded.

In fact, while most browsers will use UTF-8 to build URLs entered in the address bar, in case of HTML anchors in HTML pages, they will prefer to use the encoding of the page instead if it's also an ASCII superset.


That said, the corruption you describe seems uncalled for. In fact, I am unable to reproduce it. This is the value of $url I get in the end:

string(32) "/he/פרויקטים/ByYear.html"
 [2010-12-01 15:25 UTC] jani@php.net
-Package: *URL Functions +Package: URL related
 [2010-12-08 22:15 UTC] dextercowley at gmail dot com
This issue seems to be platform dependent. For example, on Windows Vista with PHP 5.3.1, parse_url('http://mydomain.com/path/道') returns $array['path'] = "/path/". However, on a MAC, it works correctly and returns "/path/道". 

We can work around it by uuencoding each part of the array and then decoding the various legal URL characters ("/", ":", "&", and so on) before running parse_url, then decoding the path. However, a parse_url_utf8 function would be very convenient and probably faster. Thanks.
 [2012-10-11 20:51 UTC] bugsphpnet at lumental dot com
On our Debian 4.3.2-1.1 server, changing the locale from LANG=en_US to 
LANG=en_US.UTF-8 seems to have fixed this problem.

In my opinion, parse_url() should treat all extended characters (octets 80-FF) as 
opaque characters and copy them as-is without modification.   Then, the function 
will work fine for both utf-8 and iso-8859-1 strings.  The behaviour of 
parse_url() should not depend on the LANG setting.  In my opinion, this function 
is buggy.
 [2014-01-28 21:42 UTC] derkontrollfreak+9hy5l at gmail dot com
If you use the default "C" locale you should be fine, too.
 [2016-01-15 07:57 UTC] simonsimcity at gmail dot com
This seems to be a quite old bug but still valid. I also stumbled on it and went a bit further:

It works on Ubuntu Linux. My locale is set to "en_US.UTF-8", so bugsphpnet@lumental.com's comment could be the reason for Linux installations.

It does NOT work using OS X, contrary to what dextercowley@gmail.com said. I tried it with several PHP versions (up to 7.0.2). I tried it by any locale-setting I could find that differed from my Linux-server-settings.

Related conversations: https://github.com/symfony/symfony/issues/16776, A post on the general mailinglist: http://news.php.net/php.general/325346 (wasn't able to find a URL that shows the related responses ...)

I guess this one also is related to 68296, where it's about the handling of newlines in parse_url().
 [2016-01-15 08:10 UTC] simonsimcity at gmail dot com
Related to #68296
 [2016-03-08 00:08 UTC] me at evertpot dot com
Chiming in with "me too".

This snippet works correctly on linux:

https://3v4l.org/OSlSY

On a mac the output gets corrupted. Output:

%2F%E6__%E8%AF_%E6%B3_%E5_%AB%E5__.zh

I can definitely see the reasoning behind parse_url only support valid urls, however... IF that's the intended behavior it should be consistent across platforms and fail instead of corrupting the input.

My use-case for parse_url is to actually to actually correct (and normalize) these urls, but I was assuming that unknown octets are just passed through. This is true for some octets, but not all.
 [2020-01-19 09:07 UTC] cmb@php.net
The internal implementation php_url_parse_ex()[1] uses a mix of
ctype functions (such as isalpha()) and hard-coded character
values, what looks wrong to me.

[1] <https://github.com/php/php-src/blob/php-7.3.13/ext/standard/url.c#L94-L321>
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 22:01:28 2024 UTC