php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #60412 UTF-8 functions doesn't respect unicode equivalence - Need Normalization
Submitted: 2011-11-29 22:17 UTC Modified: 2015-01-08 22:11 UTC
Votes:1
Avg. Score:1.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:0 (0.0%)
Same OS:0 (0.0%)
From: mike dot squire at gmail dot com Assigned:
Status: Analyzed Package: mbstring related
PHP Version: 5.4SVN-2011-11-04 (SVN) OS: all
Private report: No CVE-ID: None
View Add Comment Developer Edit
Anyone can comment on a bug. Have a simpler test case? Does it work for you on a different platform? Let us know!
Just going to say 'Me too!'? Don't clutter the database with that please — but make sure to vote on the bug!
Your email address:
MUST BE VALID
Solve the problem:
35 + 49 = ?
Subscribe to this entry?

 
 [2011-11-29 22:17 UTC] mike dot squire at gmail dot com
Description:
------------
Quote from http://en.wikipedia.org/wiki/Unicode_equivalence:

"...the code point U+006E (the Latin lowercase 'n') followed by U+0303 (the combining tilde '◌̃') is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter 'ñ' of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other."

It might be this is more a case of just documenting that the unicode functions don't support unicode equivalence (for completeness).

Test script:
---------------
echo "Output recorded from a terminal interpreting UTF-8\n\n";

var_dump("\x6e\xcc\x83");
var_dump(utf8_encode("\xf1"));

var_dump(utf8_decode("\x6e\xcc\x83") == "\xf1");
var_dump(mb_convert_encoding("\x6e\xcc\x83", "ISO-8859-1", "UTF-8") == "\xf1");


Expected result:
----------------
Output recorded from a terminal interpreting UTF-8

string(3) "ñ"
string(2) "ñ"
bool(true)
bool(true)

Actual result:
--------------
Output recorded from a terminal interpreting UTF-8

string(3) "ñ"
string(2) "ñ"
bool(false)
bool(false)

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2011-11-29 23:58 UTC] yohgaki@php.net
-Summary: UTF-8 functions doesn't respect unicode equivalence +Summary: UTF-8 functions doesn't respect unicode equivalence - Need Normalization -Status: Open +Status: Analyzed -Package: Unicode Engine related +Package: mbstring related -Operating System: OSX (though probably all) +Operating System: all -PHP Version: 5.3.8 +PHP Version: 5.4SVN-2011-11-04 (SVN)
 [2011-11-29 23:58 UTC] yohgaki@php.net
What you are looking for is normalization. Intl module has it, but mbstring does 
not.

I changed bug type to feature request.
 [2015-01-08 22:11 UTC] ajf@php.net
-Type: Bug +Type: Feature/Change Request
 [2015-01-08 22:11 UTC] ajf@php.net
Yasuo's change apparently didn't take effect. This change to Feature Request should stick?
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Apr 25 20:01:45 2024 UTC