php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #70609 mbstring: simple casemaps instead Unicode's recommendation: full casemaps
Submitted: 2015-09-30 13:59 UTC Modified: 2017-07-28 10:44 UTC
Votes:1
Avg. Score:3.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:0 (0.0%)
Same OS:0 (0.0%)
From: cl at exomail dot to Assigned: nikic (profile)
Status: Closed Package: mbstring related
PHP Version: Irrelevant OS: all
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: cl at exomail dot to
New email:
PHP Version: OS:

 

 [2015-09-30 13:59 UTC] cl at exomail dot to
Description:
------------
mbstring's mb_strtoupper (and partners) do "simple casemapping" instead of the recommended full casemapping.

http://www.unicode.org/versions/Unicode8.0.0/

Section 4.2 "Case", Headline "Case Mapping"
"The single-character mappings in UnicodeData.txt are insufficient for languages such as German. Therefore, only legacy implementations that cannot handle case mappings that increase string lengths should use UnicodeData.txt case mappings alone."

PHP should stop being "legacy" and "insufficient".

For implementation guidelines, see section 5.18 "Case Mappings" and
ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt


PHP's current implementation: 
php-src/ext/mbstring/ucgendat/ucgendat.c
seems to ignore "SpecialCasing.txt". 
The data structure
 static const unsigned int _uccase_map[]
in php-src/ext/mbstring/unicode_data.h assumes (IMHO) a one-to-one mapping.

BTW:
Perl, Python, Java and others all do full case mapping.

This is *not* a bleeding edge feature. Perl has this since version 5.8.0 (year 2002!!!).


Test script:
---------------
$str="\xc3\x9f"; echo $str." upper => ".mb_strtoupper($str,'UTF-8')."\n";
$str="\xc5\x89"; echo $str." upper => ".mb_strtoupper($str,'UTF-8')."\n";


Expected result:
----------------
ß upper => SS
ʼn upper => ʼN


Actual result:
--------------
ß upper => ß
ʼn upper => ʼn


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2017-07-28 10:44 UTC] nikic@php.net
I've implemented full case mapping in master, with the exception that the Final_Sigma condition is not handled.

Relevant commit: https://github.com/php/php-src/commit/582a65b06f3de125887cab02d5c561168fcf94bc
 [2017-07-28 10:44 UTC] nikic@php.net
-Status: Open +Status: Closed -Assigned To: +Assigned To: nikic
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Apr 26 13:01:28 2024 UTC