php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #19257 strtolower & strtoupper does not work for UTF-8 strings
Submitted: 2002-09-05 15:46 UTC Modified: 2002-10-07 16:36 UTC
From: gamid at isayev dot net Assigned:
Status: Closed Package: Strings related
PHP Version: 4.2.2 OS: Linux
Private report: No CVE-ID: None
 [2002-09-05 15:46 UTC] gamid at isayev dot net
Functions strtolower() & strtoupper() does not change UTF-8 strings.  I try Russian (0x042F, 0x044F) and German (0x00DC, 0x00FC) characters.
Example:

<?
$str = "testЯ";

$loc = "UTF-8";
putenv("LANG=$loc");
$loc = setlocale(LC_ALL, $loc);

$strU = strtoupper($str);
$strL = strtolower($str);
?>
<PRE>
loc  = '<? echo $loc;  ?>'
str  = '<? echo $str;  ?>'
strU = '<? echo $strU; ?>'
strL = '<? echo $strL; ?>'
</PRE>

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2002-09-05 16:02 UTC] sniper@php.net
Please try using this CVS snapshot:

  http://snaps.php.net/php4-latest.tar.gz
 
For Windows:
 
  http://snaps.php.net/win32/php4-win32-latest.zip

I can't reproduce this with PHP 4.3.0-dev.

 [2002-09-05 17:02 UTC] gamid at isayev dot net
I did backport of strtolower() & strtoupper() from the latest string.c (rev.1.290). The fuctions still does not work:
loc  = 'UTF-8'
str  = 'Test?'
strU = 'TEST?'
strL = 'test?'


Since I am not sure that you entered the UTF-8 symbols correctly, here's a modified version of the code that creates the desired test string:

<?
$str = "Test".utf8_encode("\xFC");

$loc = "UTF-8";
putenv("LANG=$loc");
$loc = setlocale(LC_ALL, $loc);

$strU = strtoupper($str);
$strL = strtolower($str);
?>
<PRE>
loc  = '<? echo $loc;  ?>'
str  = '<? echo $str;  ?>'
strU = '<? echo $strU; ?>'
strL = '<? echo $strL; ?>'
</PRE>
 [2002-09-05 17:09 UTC] sniper@php.net
Output: 

str  = 'Testü'
strU = 'TESTü'
strL = 'test??'

Don't try to patch..you're not doing it right anyway.
Just pull the snapshot and try with it.

 [2002-09-05 17:24 UTC] gamid at isayev dot net
> Don't try to patch..you're not doing it right anyway.
:)

> Just pull the snapshot and try with it.
Ok

PHP: 20020307
PHP Extension: 20020429
Zend Extension: 20020903

Output:
loc  = 'UTF-8'
str  = 'Test?'
strU = 'TEST?'
strL = 'test?'
 [2002-09-06 13:27 UTC] gamid at isayev dot net
The functions do not work in PHP 4.2.3 and latest snapshot.
 [2002-09-06 13:29 UTC] gamid at isayev dot net
BTW, here is my configuration:

./configure \
  --with-apxs=/usr/sbin/apxs \
  --enable-track-vars \
  --enable-safe-mode \
  --with-config-file-path=/etc/httpd \
  --with-zlib \
  --enable-magic-quotes \
  --with-regex=system \
  --without-mysql \
  --without-xml \
  --without-gd \
  --with-pgsql=shared \
  --with-imap \
  --with-iconv \
  --enable-mbstring \
  --with-xml \
  --with-kerberos

php.ini has 'default_charset=utf-8'.
 [2002-09-07 14:20 UTC] sniper@php.net
Exactly what distribution do you have? 
How are LANG/LANGUAGE/LC_ALL etc. environment variables
set in your system (before starting Apache)..?

And please, try with the stock php.ini-dist too.


 [2002-09-09 08:39 UTC] gamid at isayev dot net
> Exactly what distribution do you have?
Mandrake 8.1 with following RPMs installed:
locales-2.3.1.2-4mdk
locales-en-2.3.1.2-4mdk
locales-ru-2.3.1.2-4mdk
locales-de-2.3.1.2-4mdk

> How are LANG/LANGUAGE/LC_ALL etc. environment variables
> set in your system (before starting Apache)..?
All environment variables set to 'UTF-8'.

> And please, try with the stock php.ini-dist too.
Same result.
BTW, in the php.ini-dist there should be a semi-colon instead of colon in the line 98:
":       is doing." -> ";       is doing."

Could you tell me your settings so that I can try them out?
 [2002-09-09 10:45 UTC] sniper@php.net
I didn't know UTF-8 is a locale..

I tried setting LANG/LC_ALL to that and it indeed
didn't work. When I set those to "en_US" it works just fine.

 [2002-09-09 11:28 UTC] gamid at isayev dot net
> I tried setting LANG/LC_ALL to that and it indeed
> didn't work. When I set those to "en_US" it works just fine.
What you mean "works just fine"?
Did it convert 0xC39C ('?' in UTF-8 encoding) into 0xC3BC ('?' in UTF-8 encoding)? Or 0xD0AF (Russian capital "ya" in UTF-8 encoding) into 0xD18F (Russian lowercase "ya" in UTF-8 encoding)?
 [2002-09-09 17:06 UTC] sniper@php.net
So you didn't try it..? I only tried your test script and
got the expected result. Whatever the characters are..I've no idea of them anyway..

btw. AFAIK, setting LANG / LC_ALL to UTF-8 is not correct
way to do it.. 

http://melkor.dnp.fmph.uniba.sk/~garabik/debian-utf8/howto.h
tml

According to that HOWTO, it should be something like ru_RU.UTF-8 (and only if you really have UTF-8 locales)

I'm bogusing this since it really isn't anything PHP can affect..

 [2002-09-10 08:44 UTC] gamid at isayev dot net
> So you didn't try it..?
Yes, I set LC_ALL/LANG to 'en_US' and try it.

> I only tried your test script and got the expected result.
> Whatever the characters are.. I've no idea of them anyway..
I think your confused by looking on the result of test script with encoding set to 'ISO-8859-x' instead of 'UTF-8'.
In this case it looks as some characters changed to lower/upper case.
BUT they are not UTF-8 lower/upper case characters:
1) 0xC39C changed to 0xE39C, should be 0xC3BC
2) 0xD0AF changed to 0xF0AF, should be 0xD18F
As result we have not UTF-8 string but a garbage.
If you really like test this issue you should set 'default_charset=utf-8' in php.ini or set encoding to 'UTF-8' in your browser.

> btw. AFAIK, setting LANG / LC_ALL to UTF-8 is not correct
> way to do it.. 
> According to that HOWTO, it should be something like
> ru_RU.UTF-8 (and only if you really have UTF-8 locales)
I try en_US.UTF-8, de_DE.UTF-8, ru_RU.UTF-8 - no lack.

> I'm bogusing this since it really isn't anything PHP can
> affect..
So, no way in PHP convert UTF-8 string to lower/upper case?
 [2002-09-10 08:52 UTC] wez@php.net
This is not a bug in PHP; it's down to whether your system
can support this and has the appropriate locales installed.

A quick and dirty example might look this this in C:

#include <ctype.h>
main()
{
   char buff[1024];

   while(fgets(buff, sizeof(buff), stdin)) {
      int i, l;
      l = strlen(buff);
      for (i = 0; i < l; i++)
          buff[i] = toupper(buff[i]);
      puts(buff);
   }
}

If that little program works, your system supports
this conversion.  If it doesn't, then PHP doesn't
either.

 [2002-09-10 08:54 UTC] wez@php.net
I forgot to add that you should feed your utf8 data to the
input of that little program.
 [2002-09-10 09:20 UTC] gamid at isayev dot net
As I understand toupper()/tolower() are working only for one byte encodings. So right way is to use 'wide' versions of toupper()/tolower() - towupper()/towlower().
Example:

#include <stdio.h>
#include <wctype.h>
#include <locale.h>

int main() {
printf("locale set to '%s'\n", setlocale(LC_ALL, "UTF-8"));

printf("0x00DC C='%C'\n", towlower(0x00DC));
printf("0x042F C='%C'\n", towlower(0x042F));

return(0);
}

And it's working fine for UCS2 (UTF-16).
In PHP I can convert UTF-8 to UTF-16 by using iconv().
But PHP has not 'wide' version of strtolower()/strtoupper().
So, what can I do?
 [2002-09-25 20:23 UTC] wez@php.net
I've added a new function to the mbstring extension in CVS.
This function will be in PHP 4.3.

I would appreciate your feedback.
Try a snapshot from http://snaps.php.net/php4-latest.tar.gz
dated after this message.

usage:
proto string mb_convert_case(string str, int mode [, string encoding]);

mode can be one of MB_CASE_UPPER, MB_CASE_LOWER or MB_CASE_TITLE.
encoding specifies the encoding of str; if omitted, the
mbstring.internal_encoding value will be used.
The return value is str with the appropriate case folding applied.

The function works by internally converting the string into UCS-4 format
and applying php_unicode_to(upper|lower|title) to each unicode character,
and then converts the string back into the original encoding.

The code for your test case would look like this
(and works for me):

<?
$str = "Test".utf8_encode("\xFC");

$strU = mb_convert_case($str, MB_CASE_UPPER, "utf-8");
$strL = mb_convert_case($str, MB_CASE_LOWER, "utf-8");
?>
<PRE>
str  = '<? echo $str;  ?>'
strU = '<? echo $strU; ?>'
strL = '<? echo $strL; ?>'
</PRE>
 [2002-10-07 16:36 UTC] gamid at isayev dot net
Works fine for German and Russian characters.
Thans!
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 11:01:29 2024 UTC