PHP :: Bug #70475 :: ext/mbstring/unicode

Bug #70475	ext/mbstring/unicode_data.h needs update
Submitted:	2015-09-11 13:43 UTC	Modified:	2017-07-23 21:26 UTC
From:	cl at exomail dot to	Assigned:	wez (profile)
Status:	Closed	Package:	mbstring related
PHP Version:	Irrelevant	OS:	all
Private report:	No	CVE-ID:	None

View Add Comment Developer Edit

[2015-09-11 13:43 UTC] cl at exomail dot to

Description:
------------
Looking at github the last update of php-src/ext/mbstring/unicode_data.h was

2010-10-05 42dae97fd49f8d5f5d45c6254794f41fc2b32c88

So the Unicode-Data (for mb_strtoupper, etc.) is FIVE(!) years old.

There is a nice website called http://unicode.org/

I know PHP is playing in a different league than other software languages (which try to follow unicode changes as closely as possible). But think about looking every 3 or 4 years on this nice Unicode webpage and include the "newest" changes.

Even for really "old" characters mb_strtoupper does nothing, like for
U+00DF (ß) or U+0149 (ŉ).



Test script:
---------------
$str="\xc3\x9f"; echo $str." upper => ".mb_strtoupper($str,'UTF-8')."\n";
$str="\xc5\x89"; echo $str." upper => ".mb_strtoupper($str,'UTF-8')."\n";


Expected result:
----------------
expected output:
ß upper => SS
ŉ upper => ʼN


Actual result:
--------------
output of testscript:
ß upper => ß
ŉ upper => ŉ

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports

[2015-09-13 06:59 UTC] laruence@php.net

-Assigned To: +Assigned To: wez

[2015-09-14 03:02 UTC] laruence@php.net

Hmm, I have tried upgrade the unicode_data.h up to UnicodeData-8.0.0. but seems the behavior is till the same, thus I am not sure should I do the update..

thanks

[2015-09-14 09:43 UTC] cl at exomail dot to

The problem has at least two aspects:
1. old data; this can be solved with an update
2. to know what full case folding means

ad 1:
You are not really asking, if php should continue to use the very old/outdated mappings, are you?

ad 2:
If you look at php_unicode.c function case_lookup then you see the php developers had the idea that one code point is replaced by another code point. 
Hence: You have a problem if you want to replace one code point with more than one codepoint (that is what has to happen for "FULL CASE FOLDING"). Take a look at 
ftp://ftp.unicode.org/Public/UNIDATA/CaseFolding.txt
to see what Unicode's idea of full case folding is.

00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
      ^^ FULL case folding for U+00DF

In short:
case_lookup in php_unicode.c is "defective by design" for doing full case folding. There are a lot (more than 100) code points that need full case folding including the very common german "ß".

Even if you decide to give up on full case folding [I really really hope not; how long will php wait to *fully* support such simple things like strtoupper] at least the update will help for simple case folding.

[2015-09-15 14:59 UTC] laruence@php.net

okey, the Unicode_data.h is updated to 8.0.0: https://github.com/php/php-src/commit/e841016df727896342310b579f93dfc55b931caf

[2015-09-29 20:03 UTC] fsb at thefsb dot org

Data should certainly be updated. But the test script is mistaken.

Case folding is *not* the same as case mapping, see 2nd FAQ here: http://unicode.org/faq/casemap_charprop.html

mbstring provides case mapping, which is what I would expect given the method names and documentation. The test scripts here, otoh, look like they are expecting case folding.

This might explain why laruence@php.net observed that updating the Unicode data made no difference. (Btw, from the same FAQ "Beginning with Unicode 5.0, case folding became subject to stability constraints.")

Unit tests that check if mbstring is using 8.0 could focus instead on things mentioned in the Unicode 8.0 release announcement, e.g. codepoints becomming assigned. http://blog.unicode.org/2015/06/announcing-unicode-standard-version-80.html

If there's something wrong with mbstring's case conversion, it should be reported in a separate bug report.

[2015-09-29 22:49 UTC] cl at exomail dot to

The test script expects mbstring to do case *mapping* as defined in

Section 5.18 "Case Mappings" of "The Unicode Standard"
http://www.unicode.org/versions/Unicode8.0.0/ch05.pdf

There (in the *mapping* section) the "ß" is even given as example:
toUpperCase("ß") = "SS"

[2015-09-30 00:11 UTC] cl at exomail dot to

To summarize:

* The test script expects mbstring to do case *mapping* as defined in
  Section 5.18 of http://www.unicode.org/versions/Unicode8.0.0/ch05.pdf
  
* php-src/ext/mbstring/ucgendat/ucgendat.c generates unicode_data.h
  In unicode_data.h the data for 0x00df (ß) is not there. 
  Does ucgendat.c use SpecialCasing.txt?

* FAQ 1 in http://unicode.org/faq/casemap_charprop.html: 
  "Is all of the Unicode case mapping information in UnicodeData.txt?"
  "No." Use UnicodeData.txt *and* SpecialCasing.txt!
  And Unicode-Standard Section "4.2 Case": 
  "The single-character mappingsin UnicodeData.txt are insufficient for languages such as German."

* The data structure 
     static const unsigned int _uccase_map[] = {
  in php's unicode_data.h (IMHO) assumes a one-to-one mapping 
  /* Starting indexes of the case tables
   * UpperIndex = 0
   * LowerIndex = _uccase_len[0]
   * TitleIndex = LowerIndex + _uccase_len[1] */

Thats the source of the problem.


What other informations are needed to get consensus that this is a bug? [A new bug report with a new number, really?]

[2015-09-30 00:54 UTC] fsb at thefsb dot org

That's a great summary of the mapping problem, cl at exomail dot to.

The title of this bug is "ext/mbstring/unicode_data.h needs update" and laruence@php.net has since updated it to UCD 8.0, so I think single-to-multi mapping is a separate bug report / feature request.

[2015-09-30 14:01 UTC] cl at exomail dot to

https://bugs.php.net/bug.php?id=70609

[2015-09-30 18:02 UTC] fsb at thefsb dot org

(Thanks for the new bug report, cl at exomail dot to.)

Before closing this one, I think it might be wise for mbstring to use Unicode 7 in PHP 7.0 rather than Unicode 8.

First: PCRE is stuck on Unicode 7 and I think it's going to stay that way. PCRE2 is on Unicode 8 but there's no sign of PHP adopting it.

Second: intl ext is using ICU 55.1 which is Unicode 7. ICU 56-rc is available but it seems likely 7.0 will go to GA with 55.1.

So it might make sense for PHP 7.0 to be consistent and use Unicode 7 across the board.

[2017-07-23 21:26 UTC] nikic@php.net

-Status: Assigned +Status: Closed

[2017-07-23 21:26 UTC] nikic@php.net

Closing here as unicode data has been updated and bug #70609 deals with the full case mapping. I've also updated the data to Unicode 10.0 in PHP 7.2.

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2024 The PHP Group All rights reserved.	Last updated: Fri Apr 19 15:01:28 2024 UTC