php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #60412 UTF-8 functions doesn't respect unicode equivalence - Need Normalization
Submitted: 2011-11-29 22:17 UTC Modified: 2015-01-08 22:11 UTC
Votes:1
Avg. Score:1.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:0 (0.0%)
Same OS:0 (0.0%)
From: mike dot squire at gmail dot com Assigned:
Status: Analyzed Package: mbstring related
PHP Version: 5.4SVN-2011-11-04 (SVN) OS: all
Private report: No CVE-ID: None
View Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: mike dot squire at gmail dot com
New email:
PHP Version: OS:

 

 [2011-11-29 22:17 UTC] mike dot squire at gmail dot com
Description:
------------
Quote from http://en.wikipedia.org/wiki/Unicode_equivalence:

"...the code point U+006E (the Latin lowercase 'n') followed by U+0303 (the combining tilde '◌̃') is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter 'ñ' of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other."

It might be this is more a case of just documenting that the unicode functions don't support unicode equivalence (for completeness).

Test script:
---------------
echo "Output recorded from a terminal interpreting UTF-8\n\n";

var_dump("\x6e\xcc\x83");
var_dump(utf8_encode("\xf1"));

var_dump(utf8_decode("\x6e\xcc\x83") == "\xf1");
var_dump(mb_convert_encoding("\x6e\xcc\x83", "ISO-8859-1", "UTF-8") == "\xf1");


Expected result:
----------------
Output recorded from a terminal interpreting UTF-8

string(3) "ñ"
string(2) "ñ"
bool(true)
bool(true)

Actual result:
--------------
Output recorded from a terminal interpreting UTF-8

string(3) "ñ"
string(2) "ñ"
bool(false)
bool(false)

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2011-11-29 23:58 UTC] yohgaki@php.net
-Summary: UTF-8 functions doesn't respect unicode equivalence +Summary: UTF-8 functions doesn't respect unicode equivalence - Need Normalization -Status: Open +Status: Analyzed -Package: Unicode Engine related +Package: mbstring related -Operating System: OSX (though probably all) +Operating System: all -PHP Version: 5.3.8 +PHP Version: 5.4SVN-2011-11-04 (SVN)
 [2011-11-29 23:58 UTC] yohgaki@php.net
What you are looking for is normalization. Intl module has it, but mbstring does 
not.

I changed bug type to feature request.
 [2015-01-08 22:11 UTC] ajf@php.net
-Type: Bug +Type: Feature/Change Request
 [2015-01-08 22:11 UTC] ajf@php.net
Yasuo's change apparently didn't take effect. This change to Feature Request should stick?
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Mon Oct 14 14:01:27 2024 UTC