php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #70609 mbstring: simple casemaps instead Unicode's recommendation: full casemaps
Submitted: 2015-09-30 13:59 UTC Modified: 2017-07-28 10:44 UTC
Votes:1
Avg. Score:3.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:0 (0.0%)
Same OS:0 (0.0%)
From: cl at exomail dot to Assigned: nikic (profile)
Status: Closed Package: mbstring related
PHP Version: Irrelevant OS: all
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: cl at exomail dot to
New email:
PHP Version: OS:

 

 [2015-09-30 13:59 UTC] cl at exomail dot to
Description:
------------
mbstring's mb_strtoupper (and partners) do "simple casemapping" instead of the recommended full casemapping.

http://www.unicode.org/versions/Unicode8.0.0/

Section 4.2 "Case", Headline "Case Mapping"
"The single-character mappings in UnicodeData.txt are insufficient for languages such as German. Therefore, only legacy implementations that cannot handle case mappings that increase string lengths should use UnicodeData.txt case mappings alone."

PHP should stop being "legacy" and "insufficient".

For implementation guidelines, see section 5.18 "Case Mappings" and
ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt


PHP's current implementation: 
php-src/ext/mbstring/ucgendat/ucgendat.c
seems to ignore "SpecialCasing.txt". 
The data structure
 static const unsigned int _uccase_map[]
in php-src/ext/mbstring/unicode_data.h assumes (IMHO) a one-to-one mapping.

BTW:
Perl, Python, Java and others all do full case mapping.

This is *not* a bleeding edge feature. Perl has this since version 5.8.0 (year 2002!!!).


Test script:
---------------
$str="\xc3\x9f"; echo $str." upper => ".mb_strtoupper($str,'UTF-8')."\n";
$str="\xc5\x89"; echo $str." upper => ".mb_strtoupper($str,'UTF-8')."\n";


Expected result:
----------------
ß upper => SS
ʼn upper => ʼN


Actual result:
--------------
ß upper => ß
ʼn upper => ʼn


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2017-07-28 10:44 UTC] nikic@php.net
I've implemented full case mapping in master, with the exception that the Final_Sigma condition is not handled.

Relevant commit: https://github.com/php/php-src/commit/582a65b06f3de125887cab02d5c561168fcf94bc
 [2017-07-28 10:44 UTC] nikic@php.net
-Status: Open +Status: Closed -Assigned To: +Assigned To: nikic
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sun Dec 22 08:01:29 2024 UTC