php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #52981 Unicode data used by mbstring need to be updated
Submitted: 2010-10-04 00:39 UTC Modified: 2010-10-05 03:55 UTC
From: mormegil at centrum dot cz Assigned: cataphract (profile)
Status: Closed Package: Unicode Engine related
PHP Version: 5.3.3 OS:
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: mormegil at centrum dot cz
New email:
PHP Version: OS:

 

 [2010-10-04 00:39 UTC] mormegil at centrum dot cz
Description:
------------
mbstring functions seem to use old Unicode data, so the functions are not able to cope with new characters introduced to Unicode lately (e.g. mb_strtoupper ignores lower-case characters because it does not know them).

AFAICT (and I am no PHP expert, mind you), what is needed to be done is to update ext/mbstring/unicode_data.h (possibly just regenerating the file, as it says “It was generated by a modified version of ucgendat, part of the ucdata-2.5 package”).

Test script:
---------------
function test($str)
{
	$upper = mb_strtoupper($str, 'UTF-8');
	$len = strlen($upper);
	for ($i = 0; $i < $len; ++$i) echo dechex(ord($upper[$i])) . ' ';
	echo "\n";
}

// OK
test("\xF0\x90\x90\xB8");// U+10438 DESERET SMALL LETTER H (added in 3.1.0, March 2001)
// not OK
test("\xE2\xB0\xB0");	// U+2C30 GLAGOLITIC SMALL LETTER AZU (added in 4.1.0, March 2005)
test("\xD4\xA5");		// U+0525 CYRILLIC SMALL LETTER PE WITH DESCENDER (added in 5.2.0, October 2009)


Expected result:
----------------
f0 90 90 90
e2 b0 80
d4 a4

Actual result:
--------------
f0 90 90 90
e2 b0 b0
d4 a5

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2010-10-05 03:54 UTC] cataphract@php.net
Automatic comment from SVN on behalf of cataphract
Revision: http://svn.php.net/viewvc/?view=revision&amp;revision=304056
Log: - Fixed bug #52981 (Unicode casing table was out-of-date).
  Updated with UnicodeData-6.0.0d7.txt and included the
  source of the generator program with the distribution.
#The replaced tables, generated circa 2002, seem to reflect
#Unicode 3.2. I was unable to generate the same property
#offsets with Unicode 3.2 data, but all the tests I made
#indicate php_unicode_is_prop() is returning the correct
#values. The replaced file merely says it used a &quot;modified
#version&quot; of ucgendat, which is not very helpful. The results
#I got were not significantly different, only slightly higher
#offsets at two properties, which were carried over to the
#subsequent properties.
#I was, however, able to replicate precisely the casing table.
#The extent of the &quot;modifications&quot; besides omitting most of
#the tables, a slightly different layout and the casing table
#offsets having been multiplied by 3 is unclear.
#The test suite showed no regressions; however, it's very poor
#in testing the modified portion of the extension.
 [2010-10-05 03:55 UTC] cataphract@php.net
-Status: Open +Status: Closed -Assigned To: +Assigned To: cataphract
 [2010-10-05 03:55 UTC] cataphract@php.net
Fixed for PHP 5.3 and trunk.

Used the tables from the nearly final Unicode 6.0.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 11:01:29 2024 UTC