php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #48322 strcoll() does not work with UTF-8 strings on Mac OS X
Submitted: 2009-05-18 22:37 UTC Modified: 2009-05-19 14:54 UTC
From: netspy at me dot com Assigned:
Status: Wont fix Package: *Unicode Issues
PHP Version: 5.2.9 OS: Mac OS X
Private report: No CVE-ID: None
 [2009-05-18 22:37 UTC] netspy at me dot com
Description:
------------
strcoll() does not sort UTF-8 strings correctly on Mac OS X.

Reproduce code:
---------------
$locale = 'de_DE.UTF-8'; 
$string = "abcdefghijklmnopqrstuvwxyz????"; 

$array = array(); 

for ($i=0; $i<mb_strlen($string, 'UTF-8'); $i++) { 
    $array[]=mb_substr($string, $i, 1, 'UTF-8'); 
} 

$oldLocale = setlocale(LC_COLLATE, "0"); 

print("\nOld: $oldLocale New: "); 
print(setlocale(LC_COLLATE, $locale)); 
usort($array, 'strcoll'); 
setlocale(LC_COLLATE, $oldLocale); 
print("\n" . implode('', $array) . "\n"); 

Expected result:
----------------
Old: C New: de_DE.UTF-8
a?bcdefghijklmno?pqrs?tu?vwxyz

Actual result:
--------------
Old: C New: de_DE.UTF-8
abcdefghijklmnopqrstuvwxyz????

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2009-05-19 10:50 UTC] jani@php.net
It doesn't work on any system below PHP 6. You can always use the intl extension from PECL while waiting for proper unicode support: http://pecl.php.net/intl 

Using the collator (http://php.net/collator) you can achieve sorting with any locales.
 [2009-05-19 12:35 UTC] netspy at me dot com
On Linux strcoll works fine, I get only on Mac OS X (BSD) a false 
order. I also test it with a ISO 8859-1 string and locale 
de_DE.ISO8859-1. The same result, on Linux correct, on Mac OS X wrong.

So I think it's not a Unicode issue!

Here is another test code:

$string_utf = "abcdefghijklmnopqrstuvwxyz????";
$string_iso = utf8_decode($string_utf);

$array_utf = array(); $array_iso = array();

for ($i=0; $i<mb_strlen($string_utf, 'UTF-8'); $i++) {
    $array_utf[]=mb_substr($string_utf, $i, 1, 'UTF-8');
    $array_iso[]=substr($string_iso, $i, 1);
}

print("\nLocale: " . setlocale(LC_COLLATE, 'de_DE.UTF-8'));
usort($array_utf, 'strcoll');
print("\n" . implode('', $array_utf) . "\n");

print("\nLocale: " . setlocale(LC_COLLATE, 'de_DE.ISO8859-1'));
usort($array_iso, 'strcoll');
print("\n" . utf8_encode(implode('', $array_iso)) . "\n");


The result on Mac OS X:

Locale: de_DE.UTF-8
abcdefghijklmnopqrstuvwxyz????

Locale: de_DE.ISO8859-1
abcdefghijklmnopqrstuvwxyz????

And the Linux result:

Locale: de_DE.UTF-8
a?bcdefghijklmno?pqrs?tu?vwxyz

Locale: de_DE.ISO8859-1
a?bcdefghijklmno?pqrs?tu?vwxyz
 [2009-05-19 12:58 UTC] jani@php.net
I get the wrong order on Linux. Did you mix the results there? Anyways, this really is a problem in unicode support. To get _really_ working stuff, use the intl extension or wait for PHP 6. Wont fix.
 [2009-05-19 14:03 UTC] netspy at me dot com
What is your result on Linux? Do you saved the test file with UTF-8 
coding?

Because strcoll is basically a C function, I can't see why it is a PHP 
Unicode issue and why you close the bug as Wont fix.
 [2009-05-19 14:54 UTC] rasmus@php.net
The code for this function is just:

    RETURN_LONG(strcoll((const char *) Z_STRVAL_PP(s1),
                        (const char *) Z_STRVAL_PP(s2)));

We use the underlying system strcoll function.  There is nothing for us to fix here.  If your system's strcoll function is broken, you are out of luck.  OSX has a long history of buggy C99 functions and it wouldn't surprise me if the strcoll function doesn't handle UTF8 locales correctly.  But that still isn't something we can fix short of doing an OS-specific hack here which we try to avoid.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Mon Dec 30 14:01:28 2024 UTC