|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2009-05-18 22:37 UTC] netspy at me dot com
Description:
------------
strcoll() does not sort UTF-8 strings correctly on Mac OS X.
Reproduce code:
---------------
$locale = 'de_DE.UTF-8';
$string = "abcdefghijklmnopqrstuvwxyz????";
$array = array();
for ($i=0; $i<mb_strlen($string, 'UTF-8'); $i++) {
$array[]=mb_substr($string, $i, 1, 'UTF-8');
}
$oldLocale = setlocale(LC_COLLATE, "0");
print("\nOld: $oldLocale New: ");
print(setlocale(LC_COLLATE, $locale));
usort($array, 'strcoll');
setlocale(LC_COLLATE, $oldLocale);
print("\n" . implode('', $array) . "\n");
Expected result:
----------------
Old: C New: de_DE.UTF-8
a?bcdefghijklmno?pqrs?tu?vwxyz
Actual result:
--------------
Old: C New: de_DE.UTF-8
abcdefghijklmnopqrstuvwxyz????
PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Thu Nov 06 10:00:01 2025 UTC |
On Linux strcoll works fine, I get only on Mac OS X (BSD) a false order. I also test it with a ISO 8859-1 string and locale de_DE.ISO8859-1. The same result, on Linux correct, on Mac OS X wrong. So I think it's not a Unicode issue! Here is another test code: $string_utf = "abcdefghijklmnopqrstuvwxyz????"; $string_iso = utf8_decode($string_utf); $array_utf = array(); $array_iso = array(); for ($i=0; $i<mb_strlen($string_utf, 'UTF-8'); $i++) { $array_utf[]=mb_substr($string_utf, $i, 1, 'UTF-8'); $array_iso[]=substr($string_iso, $i, 1); } print("\nLocale: " . setlocale(LC_COLLATE, 'de_DE.UTF-8')); usort($array_utf, 'strcoll'); print("\n" . implode('', $array_utf) . "\n"); print("\nLocale: " . setlocale(LC_COLLATE, 'de_DE.ISO8859-1')); usort($array_iso, 'strcoll'); print("\n" . utf8_encode(implode('', $array_iso)) . "\n"); The result on Mac OS X: Locale: de_DE.UTF-8 abcdefghijklmnopqrstuvwxyz???? Locale: de_DE.ISO8859-1 abcdefghijklmnopqrstuvwxyz???? And the Linux result: Locale: de_DE.UTF-8 a?bcdefghijklmno?pqrs?tu?vwxyz Locale: de_DE.ISO8859-1 a?bcdefghijklmno?pqrs?tu?vwxyzThe code for this function is just: RETURN_LONG(strcoll((const char *) Z_STRVAL_PP(s1), (const char *) Z_STRVAL_PP(s2))); We use the underlying system strcoll function. There is nothing for us to fix here. If your system's strcoll function is broken, you are out of luck. OSX has a long history of buggy C99 functions and it wouldn't surprise me if the strcoll function doesn't handle UTF8 locales correctly. But that still isn't something we can fix short of doing an OS-specific hack here which we try to avoid.