php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #48360 urlencode and rawurlencode are not RFC-1738 compliant
Submitted: 2009-05-22 10:17 UTC Modified: 2009-06-02 08:26 UTC
From: martin2007 at laposte dot net Assigned:
Status: Not a bug Package: URL related
PHP Version: 5.2.9 OS: Linux
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If this is not your bug, you can add a comment by following this link.
If this is your bug, but you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: martin2007 at laposte dot net
New email:
PHP Version: OS:

 

 [2009-05-22 10:17 UTC] martin2007 at laposte dot net
Description:
------------
urlencode and rawurlencode are not RFC-1738 compliant.

RFC-1738 states:
" Thus, only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL."
Later on, the grammar is as follows:

unreserved     = alpha | digit | safe | extra
safe           = "$" | "-" | "_" | "." | "+"
extra          = "!" | "*" | "'" | "(" | ")" | ","


However, urlencode and rawurlencode encode $!*'(),

Note that, except for "$" and ",", this is also true for RFC-2396 (URI).

The main problem is that Google uses another encoding scheme. When you have URLs containing these characters, your weblogs contain several different URLs for the same resource. It might also confuse some web server implementations.


See: http://www.monperrus.net/martin/googenc/


Reproduce code:
---------------
echo urlencode("$!*'(),");
echo rawurlencode("$!*'(),");

Expected result:
----------------
$!*'(),
$!*'(),

Actual result:
--------------
%24%21%2A%27%28%29%2C
%24%21%2A%27%28%29%2C

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2009-05-22 11:47 UTC] lbarnaud@php.net
Sorry, but your problem does not imply a bug in PHP itself.  For a
list of more appropriate places to ask for help using PHP, please
visit http://www.php.net/support.php as this bug system is not the
appropriate forum for asking support questions.  Due to the volume
of reports we can not explain in detail here why your report is not
a bug.  The support channels will be able to provide an explanation
for you.

Thank you for your interest in PHP.

From the RFC:
   Usually a URL has the same interpretation when an octet is
   represented by a character and when it encoded. [...]

   [...] characters that are not required to be encoded
   (including alphanumerics) may be encoded within the scheme-specific
   part of a URL, as long as they are not being used for a reserved
   purpose.


This means urlencode() may encode everything, including alphanumerics, and still be RFC1738 compliant.

www.example.com/$!*'(), === www.example.com/%24%21%2A%27%28%29%2C
www.example.com/%24%21%2A%27%28%29%2C === www.example.com/$!*'(),

For your experiment, you may want to try linking twice times the same page, encoded differently. Then check if Google indexes the page twice with two different URLs.

Search engines are smart enough to canonicalize every URL they have to work with. Two URLs encoded differently are still the same.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Mar 29 04:01:29 2024 UTC