php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #52981 Unicode data used by mbstring need to be updated
Submitted: 2010-10-04 00:39 UTC Modified: 2010-10-05 03:55 UTC
From: mormegil at centrum dot cz Assigned: cataphract (profile)
Status: Closed Package: Unicode Engine related
PHP Version: 5.3.3 OS:
Private report: No CVE-ID: None
 [2010-10-04 00:39 UTC] mormegil at centrum dot cz
Description:
------------
mbstring functions seem to use old Unicode data, so the functions are not able to cope with new characters introduced to Unicode lately (e.g. mb_strtoupper ignores lower-case characters because it does not know them).

AFAICT (and I am no PHP expert, mind you), what is needed to be done is to update ext/mbstring/unicode_data.h (possibly just regenerating the file, as it says “It was generated by a modified version of ucgendat, part of the ucdata-2.5 package”).

Test script:
---------------
function test($str)
{
	$upper = mb_strtoupper($str, 'UTF-8');
	$len = strlen($upper);
	for ($i = 0; $i < $len; ++$i) echo dechex(ord($upper[$i])) . ' ';
	echo "\n";
}

// OK
test("\xF0\x90\x90\xB8");// U+10438 DESERET SMALL LETTER H (added in 3.1.0, March 2001)
// not OK
test("\xE2\xB0\xB0");	// U+2C30 GLAGOLITIC SMALL LETTER AZU (added in 4.1.0, March 2005)
test("\xD4\xA5");		// U+0525 CYRILLIC SMALL LETTER PE WITH DESCENDER (added in 5.2.0, October 2009)


Expected result:
----------------
f0 90 90 90
e2 b0 80
d4 a4

Actual result:
--------------
f0 90 90 90
e2 b0 b0
d4 a5

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2010-10-05 03:54 UTC] cataphract@php.net
Automatic comment from SVN on behalf of cataphract
Revision: http://svn.php.net/viewvc/?view=revision&amp;revision=304056
Log: - Fixed bug #52981 (Unicode casing table was out-of-date).
  Updated with UnicodeData-6.0.0d7.txt and included the
  source of the generator program with the distribution.
#The replaced tables, generated circa 2002, seem to reflect
#Unicode 3.2. I was unable to generate the same property
#offsets with Unicode 3.2 data, but all the tests I made
#indicate php_unicode_is_prop() is returning the correct
#values. The replaced file merely says it used a &quot;modified
#version&quot; of ucgendat, which is not very helpful. The results
#I got were not significantly different, only slightly higher
#offsets at two properties, which were carried over to the
#subsequent properties.
#I was, however, able to replicate precisely the casing table.
#The extent of the &quot;modifications&quot; besides omitting most of
#the tables, a slightly different layout and the casing table
#offsets having been multiplied by 3 is unclear.
#The test suite showed no regressions; however, it's very poor
#in testing the modified portion of the extension.
 [2010-10-05 03:55 UTC] cataphract@php.net
-Status: Open +Status: Closed -Assigned To: +Assigned To: cataphract
 [2010-10-05 03:55 UTC] cataphract@php.net
Fixed for PHP 5.3 and trunk.

Used the tables from the nearly final Unicode 6.0.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Tue Mar 19 05:01:29 2024 UTC