php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #60412 UTF-8 functions doesn't respect unicode equivalence - Need Normalization
Submitted: 2011-11-29 22:17 UTC Modified: 2015-01-08 22:11 UTC
Votes:1
Avg. Score:1.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:0 (0.0%)
Same OS:0 (0.0%)
From: mike dot squire at gmail dot com Assigned:
Status: Analyzed Package: mbstring related
PHP Version: 5.4SVN-2011-11-04 (SVN) OS: all
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If this is not your bug, you can add a comment by following this link.
If this is your bug, but you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: mike dot squire at gmail dot com
New email:
PHP Version: OS:

 

 [2011-11-29 22:17 UTC] mike dot squire at gmail dot com
Description:
------------
Quote from http://en.wikipedia.org/wiki/Unicode_equivalence:

"...the code point U+006E (the Latin lowercase 'n') followed by U+0303 (the combining tilde '◌̃') is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter 'ñ' of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other."

It might be this is more a case of just documenting that the unicode functions don't support unicode equivalence (for completeness).

Test script:
---------------
echo "Output recorded from a terminal interpreting UTF-8\n\n";

var_dump("\x6e\xcc\x83");
var_dump(utf8_encode("\xf1"));

var_dump(utf8_decode("\x6e\xcc\x83") == "\xf1");
var_dump(mb_convert_encoding("\x6e\xcc\x83", "ISO-8859-1", "UTF-8") == "\xf1");


Expected result:
----------------
Output recorded from a terminal interpreting UTF-8

string(3) "ñ"
string(2) "ñ"
bool(true)
bool(true)

Actual result:
--------------
Output recorded from a terminal interpreting UTF-8

string(3) "ñ"
string(2) "ñ"
bool(false)
bool(false)

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2011-11-29 23:58 UTC] yohgaki@php.net
-Summary: UTF-8 functions doesn't respect unicode equivalence +Summary: UTF-8 functions doesn't respect unicode equivalence - Need Normalization -Status: Open +Status: Analyzed -Package: Unicode Engine related +Package: mbstring related -Operating System: OSX (though probably all) +Operating System: all -PHP Version: 5.3.8 +PHP Version: 5.4SVN-2011-11-04 (SVN)
 [2011-11-29 23:58 UTC] yohgaki@php.net
What you are looking for is normalization. Intl module has it, but mbstring does 
not.

I changed bug type to feature request.
 [2015-01-08 22:11 UTC] ajf@php.net
-Type: Bug +Type: Feature/Change Request
 [2015-01-08 22:11 UTC] ajf@php.net
Yasuo's change apparently didn't take effect. This change to Feature Request should stick?
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Mar 29 12:01:27 2024 UTC