php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #31649 urldecode should support %uHHHH Unicode codepoint notation, which is standard
Submitted: 2005-01-21 22:29 UTC Modified: 2022-04-08 08:19 UTC
Votes:2
Avg. Score:4.5 ± 0.5
Reproduced:1 of 2 (50.0%)
Same Version:0 (0.0%)
Same OS:1 (100.0%)
From: james at gogo dot co dot nz Assigned: ilutov (profile)
Status: Closed Package: URL related
PHP Version: * OS: All
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If this is not your bug, you can add a comment by following this link.
If this is your bug, but you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: james at gogo dot co dot nz
New email:
PHP Version: OS:

 

 [2005-01-21 22:29 UTC] james at gogo dot co dot nz
Description:
------------
urldecode() does not understand the %uxxxx format for escaping unicode characters above 0xFF.

This is a very old bug, originally reported as bug #15027 and declared bogus, I believe erroneously, and here is the reasoning...

In all modern browsers (including Mozilla), JavaScript's escape() function uses %HH for Unicode codepoints below 0x0100, but %uHHHH for codepoints above there.

From ECMA-262:
--------------
For characters whose Unicode encoding is 0xFF or less, a two-digit escape sequence of the form %xx is used in accordance with RFC1738. For characters whose Unicode encoding is greater than 0xFF, a four-digit escape sequence of the form %uxxxx is used.
--------------

I believe this is a bug, PHP is unable to urldecode the valid escape()d values from modern browsers when those escape()d strings contain unicode characters greater than 0xFF.  

Declaring it not a bug because it is not in the RFCs, but rather defined by ECMA is a poor decision.



Reproduce code:
---------------
echo urldecode('%u2013');


Expected result:
----------------
A string containing the three characters comprising the unicode character 0x2013 (En Dash) in utf-8, namely 0xE2 0x80 and 0x93.

Actual result:
--------------
The literal string "%u2013".

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2005-01-21 22:46 UTC] derick@php.net
PHP doesnt support unicode in a whole lot of places. Marking this as a feature request instead.
 [2015-01-08 23:26 UTC] ajf@php.net
-Package: Feature/Change Request +Package: URL related -PHP Version: 4.3.10 +PHP Version: *
 [2015-01-08 23:26 UTC] ajf@php.net
This should probably decode to UTF-8, if it decodes to anything.
 [2015-01-08 23:26 UTC] ajf@php.net
-Summary: urldecode does not follow ECMA standard or standard browser practice +Summary: urldecode should support %uHHHH Unicode codepoint notation, which is standard
 [2018-03-26 21:59 UTC] cmb@php.net
The %uxxxx encoding is non-standard, and the escape() function
is contained in an annex of ECMA-262 (Edition 6.0) only, which states[1]:

| All of the language features and behaviours specified in this
| annex have one or more undesirable characteristics and in the
| absence of legacy usage would be removed from this specification.

In my opinion, it does not make sense to support %uxxxx encoding in
urldecode().

[1] <https://www.ecma-international.org/ecma-262/6.0/#sec-additional-ecmascript-features-for-web-browsers>
 [2022-04-08 08:19 UTC] ilutov@php.net
-Status: Open +Status: Closed -Assigned To: +Assigned To: ilutov
 [2022-04-08 08:19 UTC] ilutov@php.net
As per comment from @cmb I'm closing this issue
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Apr 19 05:01:29 2024 UTC