| Bug #19257 | strtolower & strtoupper does not work for UTF-8 strings | ||||
|---|---|---|---|---|---|
| Submitted: | 5 Sep 2002 3:46pm UTC | Modified: | 7 Oct 2002 4:36pm UTC | ||
| From: | gamid at isayev dot net | Assigned to: | |||
| Status: | Closed | Category: | Strings related | ||
| Version: | 4.2.2 | OS: | Linux | ||
[5 Sep 2002 4:02pm UTC] sniper@php.net
Please try using this CVS snapshot: http://snaps.php.net/php4-latest.tar.gz For Windows: http://snaps.php.net/win32/php4-win32-latest.zip I can't reproduce this with PHP 4.3.0-dev.
[5 Sep 2002 5:02pm UTC] gamid at isayev dot net
I did backport of strtolower() & strtoupper() from the latest string.c
(rev.1.290). The fuctions still does not work:
loc = 'UTF-8'
str = 'Test'
strU = 'TEST'
strL = 'test'
Since I am not sure that you entered the UTF-8 symbols correctly, here's
a modified version of the code that creates the desired test string:
<?
$str = "Test".utf8_encode("\xFC");
$loc = "UTF-8";
putenv("LANG=$loc");
$loc = setlocale(LC_ALL, $loc);
$strU = strtoupper($str);
$strL = strtolower($str);
?>
<PRE>
loc = '<? echo $loc; ?>'
str = '<? echo $str; ?>'
strU = '<? echo $strU; ?>'
strL = '<? echo $strL; ?>'
</PRE>
[5 Sep 2002 5:09pm UTC] sniper@php.net
Output: str = 'Testü' strU = 'TESTü' strL = 'test' Don't try to patch..you're not doing it right anyway. Just pull the snapshot and try with it.
[5 Sep 2002 5:24pm UTC] gamid at isayev dot net
> Don't try to patch..you're not doing it right anyway. :) > Just pull the snapshot and try with it. Ok PHP: 20020307 PHP Extension: 20020429 Zend Extension: 20020903 Output: loc = 'UTF-8' str = 'Test' strU = 'TEST' strL = 'test'
[6 Sep 2002 1:27pm UTC] gamid at isayev dot net
The functions do not work in PHP 4.2.3 and latest snapshot.
[6 Sep 2002 1:29pm UTC] gamid at isayev dot net
BTW, here is my configuration: ./configure \ --with-apxs=/usr/sbin/apxs \ --enable-track-vars \ --enable-safe-mode \ --with-config-file-path=/etc/httpd \ --with-zlib \ --enable-magic-quotes \ --with-regex=system \ --without-mysql \ --without-xml \ --without-gd \ --with-pgsql=shared \ --with-imap \ --with-iconv \ --enable-mbstring \ --with-xml \ --with-kerberos php.ini has 'default_charset=utf-8'.
[7 Sep 2002 2:20pm UTC] sniper@php.net
Exactly what distribution do you have? How are LANG/LANGUAGE/LC_ALL etc. environment variables set in your system (before starting Apache)..? And please, try with the stock php.ini-dist too.
[9 Sep 2002 8:39am UTC] gamid at isayev dot net
> Exactly what distribution do you have? Mandrake 8.1 with following RPMs installed: locales-2.3.1.2-4mdk locales-en-2.3.1.2-4mdk locales-ru-2.3.1.2-4mdk locales-de-2.3.1.2-4mdk > How are LANG/LANGUAGE/LC_ALL etc. environment variables > set in your system (before starting Apache)..? All environment variables set to 'UTF-8'. > And please, try with the stock php.ini-dist too. Same result. BTW, in the php.ini-dist there should be a semi-colon instead of colon in the line 98: ": is doing." -> "; is doing." Could you tell me your settings so that I can try them out?
[9 Sep 2002 10:45am UTC] sniper@php.net
I didn't know UTF-8 is a locale.. I tried setting LANG/LC_ALL to that and it indeed didn't work. When I set those to "en_US" it works just fine.
[9 Sep 2002 11:28am UTC] gamid at isayev dot net
> I tried setting LANG/LC_ALL to that and it indeed
> didn't work. When I set those to "en_US" it works just fine.
What you mean "works just fine"?
Did it convert 0xC39C ('' in UTF-8 encoding) into 0xC3BC ('' in UTF-8
encoding)? Or 0xD0AF (Russian capital "ya" in UTF-8 encoding) into
0xD18F (Russian lowercase "ya" in UTF-8 encoding)?
[9 Sep 2002 5:06pm UTC] sniper@php.net
So you didn't try it..? I only tried your test script and got the expected result. Whatever the characters are..I've no idea of them anyway.. btw. AFAIK, setting LANG / LC_ALL to UTF-8 is not correct way to do it.. http://melkor.dnp.fmph.uniba.sk/~garabik/debian-utf8/howto.h tml According to that HOWTO, it should be something like ru_RU.UTF-8 (and only if you really have UTF-8 locales) I'm bogusing this since it really isn't anything PHP can affect..
[10 Sep 2002 8:44am UTC] gamid at isayev dot net
> So you didn't try it..? Yes, I set LC_ALL/LANG to 'en_US' and try it. > I only tried your test script and got the expected result. > Whatever the characters are.. I've no idea of them anyway.. I think your confused by looking on the result of test script with encoding set to 'ISO-8859-x' instead of 'UTF-8'. In this case it looks as some characters changed to lower/upper case. BUT they are not UTF-8 lower/upper case characters: 1) 0xC39C changed to 0xE39C, should be 0xC3BC 2) 0xD0AF changed to 0xF0AF, should be 0xD18F As result we have not UTF-8 string but a garbage. If you really like test this issue you should set 'default_charset=utf-8' in php.ini or set encoding to 'UTF-8' in your browser. > btw. AFAIK, setting LANG / LC_ALL to UTF-8 is not correct > way to do it.. > According to that HOWTO, it should be something like > ru_RU.UTF-8 (and only if you really have UTF-8 locales) I try en_US.UTF-8, de_DE.UTF-8, ru_RU.UTF-8 - no lack. > I'm bogusing this since it really isn't anything PHP can > affect.. So, no way in PHP convert UTF-8 string to lower/upper case?
[10 Sep 2002 8:52am UTC] wez@php.net
This is not a bug in PHP; it's down to whether your system
can support this and has the appropriate locales installed.
A quick and dirty example might look this this in C:
#include <ctype.h>
main()
{
char buff[1024];
while(fgets(buff, sizeof(buff), stdin)) {
int i, l;
l = strlen(buff);
for (i = 0; i < l; i++)
buff[i] = toupper(buff[i]);
puts(buff);
}
}
If that little program works, your system supports
this conversion. If it doesn't, then PHP doesn't
either.
[10 Sep 2002 8:54am UTC] wez@php.net
I forgot to add that you should feed your utf8 data to the input of that little program.
[10 Sep 2002 9:20am UTC] gamid at isayev dot net
As I understand toupper()/tolower() are working only for one byte
encodings. So right way is to use 'wide' versions of toupper()/tolower()
- towupper()/towlower().
Example:
#include <stdio.h>
#include <wctype.h>
#include <locale.h>
int main() {
printf("locale set to '%s'\n", setlocale(LC_ALL, "UTF-8"));
printf("0x00DC C='%C'\n", towlower(0x00DC));
printf("0x042F C='%C'\n", towlower(0x042F));
return(0);
}
And it's working fine for UCS2 (UTF-16).
In PHP I can convert UTF-8 to UTF-16 by using iconv().
But PHP has not 'wide' version of strtolower()/strtoupper().
So, what can I do?
[25 Sep 2002 8:23pm UTC] wez@php.net
I've added a new function to the mbstring extension in CVS. This function will be in PHP 4.3. I would appreciate your feedback. Try a snapshot from http://snaps.php.net/php4-latest.tar.gz dated after this message. usage: proto string mb_convert_case(string str, int mode [, string encoding]); mode can be one of MB_CASE_UPPER, MB_CASE_LOWER or MB_CASE_TITLE. encoding specifies the encoding of str; if omitted, the mbstring.internal_encoding value will be used. The return value is str with the appropriate case folding applied. The function works by internally converting the string into UCS-4 format and applying php_unicode_to(upper|lower|title) to each unicode character, and then converts the string back into the original encoding. The code for your test case would look like this (and works for me): <? $str = "Test".utf8_encode("\xFC"); $strU = mb_convert_case($str, MB_CASE_UPPER, "utf-8"); $strL = mb_convert_case($str, MB_CASE_LOWER, "utf-8"); ?> <PRE> str = '<? echo $str; ?>' strU = '<? echo $strU; ?>' strL = '<? echo $strL; ?>' </PRE>
[7 Oct 2002 4:36pm UTC] gamid at isayev dot net
Works fine for German and Russian characters. Thans!

Functions strtolower() & strtoupper() does not change UTF-8 strings. I try Russian (0x042F, 0x044F) and German (0x00DC, 0x00FC) characters. Example: <? $str = "testЯ"; $loc = "UTF-8"; putenv("LANG=$loc"); $loc = setlocale(LC_ALL, $loc); $strU = strtoupper($str); $strL = strtolower($str); ?> <PRE> loc = '<? echo $loc; ?>' str = '<? echo $str; ?>' strU = '<? echo $strU; ?>' strL = '<? echo $strL; ?>' </PRE>