php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #31649 urldecode should support %uHHHH Unicode codepoint notation, which is standard
Submitted: 2005-01-21 22:29 UTC Modified: 2022-04-08 08:19 UTC
Votes:2
Avg. Score:4.5 ± 0.5
Reproduced:1 of 2 (50.0%)
Same Version:0 (0.0%)
Same OS:1 (100.0%)
From: james at gogo dot co dot nz Assigned: ilutov (profile)
Status: Closed Package: URL related
PHP Version: * OS: All
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: james at gogo dot co dot nz
New email:
PHP Version: OS:

 

 [2005-01-21 22:29 UTC] james at gogo dot co dot nz
Description:
------------
urldecode() does not understand the %uxxxx format for escaping unicode characters above 0xFF.

This is a very old bug, originally reported as bug #15027 and declared bogus, I believe erroneously, and here is the reasoning...

In all modern browsers (including Mozilla), JavaScript's escape() function uses %HH for Unicode codepoints below 0x0100, but %uHHHH for codepoints above there.

From ECMA-262:
--------------
For characters whose Unicode encoding is 0xFF or less, a two-digit escape sequence of the form %xx is used in accordance with RFC1738. For characters whose Unicode encoding is greater than 0xFF, a four-digit escape sequence of the form %uxxxx is used.
--------------

I believe this is a bug, PHP is unable to urldecode the valid escape()d values from modern browsers when those escape()d strings contain unicode characters greater than 0xFF.  

Declaring it not a bug because it is not in the RFCs, but rather defined by ECMA is a poor decision.



Reproduce code:
---------------
echo urldecode('%u2013');


Expected result:
----------------
A string containing the three characters comprising the unicode character 0x2013 (En Dash) in utf-8, namely 0xE2 0x80 and 0x93.

Actual result:
--------------
The literal string "%u2013".

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2005-01-21 22:46 UTC] derick@php.net
PHP doesnt support unicode in a whole lot of places. Marking this as a feature request instead.
 [2015-01-08 23:26 UTC] ajf@php.net
-Package: Feature/Change Request +Package: URL related -PHP Version: 4.3.10 +PHP Version: *
 [2015-01-08 23:26 UTC] ajf@php.net
This should probably decode to UTF-8, if it decodes to anything.
 [2015-01-08 23:26 UTC] ajf@php.net
-Summary: urldecode does not follow ECMA standard or standard browser practice +Summary: urldecode should support %uHHHH Unicode codepoint notation, which is standard
 [2018-03-26 21:59 UTC] cmb@php.net
The %uxxxx encoding is non-standard, and the escape() function
is contained in an annex of ECMA-262 (Edition 6.0) only, which states[1]:

| All of the language features and behaviours specified in this
| annex have one or more undesirable characteristics and in the
| absence of legacy usage would be removed from this specification.

In my opinion, it does not make sense to support %uxxxx encoding in
urldecode().

[1] <https://www.ecma-international.org/ecma-262/6.0/#sec-additional-ecmascript-features-for-web-browsers>
 [2022-04-08 08:19 UTC] ilutov@php.net
-Status: Open +Status: Closed -Assigned To: +Assigned To: ilutov
 [2022-04-08 08:19 UTC] ilutov@php.net
As per comment from @cmb I'm closing this issue
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Tue Jul 01 02:01:36 2025 UTC