php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #19257 strtolower & strtoupper does not work for UTF-8 strings
Submitted: 2002-09-05 15:46 UTC Modified: 2002-10-07 16:36 UTC
From: gamid at isayev dot net Assigned:
Status: Closed Package: Strings related
PHP Version: 4.2.2 OS: Linux
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: gamid at isayev dot net
New email:
PHP Version: OS:

 

 [2002-09-05 15:46 UTC] gamid at isayev dot net
Functions strtolower() & strtoupper() does not change UTF-8 strings.  I try Russian (0x042F, 0x044F) and German (0x00DC, 0x00FC) characters.
Example:

<?
$str = "testЯ";

$loc = "UTF-8";
putenv("LANG=$loc");
$loc = setlocale(LC_ALL, $loc);

$strU = strtoupper($str);
$strL = strtolower($str);
?>
<PRE>
loc  = '<? echo $loc;  ?>'
str  = '<? echo $str;  ?>'
strU = '<? echo $strU; ?>'
strL = '<? echo $strL; ?>'
</PRE>

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2002-09-05 16:02 UTC] sniper@php.net
Please try using this CVS snapshot:

  http://snaps.php.net/php4-latest.tar.gz
 
For Windows:
 
  http://snaps.php.net/win32/php4-win32-latest.zip

I can't reproduce this with PHP 4.3.0-dev.

 [2002-09-05 17:02 UTC] gamid at isayev dot net
I did backport of strtolower() & strtoupper() from the latest string.c (rev.1.290). The fuctions still does not work:
loc  = 'UTF-8'
str  = 'Test?'
strU = 'TEST?'
strL = 'test?'


Since I am not sure that you entered the UTF-8 symbols correctly, here's a modified version of the code that creates the desired test string:

<?
$str = "Test".utf8_encode("\xFC");

$loc = "UTF-8";
putenv("LANG=$loc");
$loc = setlocale(LC_ALL, $loc);

$strU = strtoupper($str);
$strL = strtolower($str);
?>
<PRE>
loc  = '<? echo $loc;  ?>'
str  = '<? echo $str;  ?>'
strU = '<? echo $strU; ?>'
strL = '<? echo $strL; ?>'
</PRE>
 [2002-09-05 17:09 UTC] sniper@php.net
Output: 

str  = 'Testü'
strU = 'TESTü'
strL = 'test??'

Don't try to patch..you're not doing it right anyway.
Just pull the snapshot and try with it.

 [2002-09-05 17:24 UTC] gamid at isayev dot net
> Don't try to patch..you're not doing it right anyway.
:)

> Just pull the snapshot and try with it.
Ok

PHP: 20020307
PHP Extension: 20020429
Zend Extension: 20020903

Output:
loc  = 'UTF-8'
str  = 'Test?'
strU = 'TEST?'
strL = 'test?'
 [2002-09-06 13:27 UTC] gamid at isayev dot net
The functions do not work in PHP 4.2.3 and latest snapshot.
 [2002-09-06 13:29 UTC] gamid at isayev dot net
BTW, here is my configuration:

./configure \
  --with-apxs=/usr/sbin/apxs \
  --enable-track-vars \
  --enable-safe-mode \
  --with-config-file-path=/etc/httpd \
  --with-zlib \
  --enable-magic-quotes \
  --with-regex=system \
  --without-mysql \
  --without-xml \
  --without-gd \
  --with-pgsql=shared \
  --with-imap \
  --with-iconv \
  --enable-mbstring \
  --with-xml \
  --with-kerberos

php.ini has 'default_charset=utf-8'.
 [2002-09-07 14:20 UTC] sniper@php.net
Exactly what distribution do you have? 
How are LANG/LANGUAGE/LC_ALL etc. environment variables
set in your system (before starting Apache)..?

And please, try with the stock php.ini-dist too.


 [2002-09-09 08:39 UTC] gamid at isayev dot net
> Exactly what distribution do you have?
Mandrake 8.1 with following RPMs installed:
locales-2.3.1.2-4mdk
locales-en-2.3.1.2-4mdk
locales-ru-2.3.1.2-4mdk
locales-de-2.3.1.2-4mdk

> How are LANG/LANGUAGE/LC_ALL etc. environment variables
> set in your system (before starting Apache)..?
All environment variables set to 'UTF-8'.

> And please, try with the stock php.ini-dist too.
Same result.
BTW, in the php.ini-dist there should be a semi-colon instead of colon in the line 98:
":       is doing." -> ";       is doing."

Could you tell me your settings so that I can try them out?
 [2002-09-09 10:45 UTC] sniper@php.net
I didn't know UTF-8 is a locale..

I tried setting LANG/LC_ALL to that and it indeed
didn't work. When I set those to "en_US" it works just fine.

 [2002-09-09 11:28 UTC] gamid at isayev dot net
> I tried setting LANG/LC_ALL to that and it indeed
> didn't work. When I set those to "en_US" it works just fine.
What you mean "works just fine"?
Did it convert 0xC39C ('?' in UTF-8 encoding) into 0xC3BC ('?' in UTF-8 encoding)? Or 0xD0AF (Russian capital "ya" in UTF-8 encoding) into 0xD18F (Russian lowercase "ya" in UTF-8 encoding)?
 [2002-09-09 17:06 UTC] sniper@php.net
So you didn't try it..? I only tried your test script and
got the expected result. Whatever the characters are..I've no idea of them anyway..

btw. AFAIK, setting LANG / LC_ALL to UTF-8 is not correct
way to do it.. 

http://melkor.dnp.fmph.uniba.sk/~garabik/debian-utf8/howto.h
tml

According to that HOWTO, it should be something like ru_RU.UTF-8 (and only if you really have UTF-8 locales)

I'm bogusing this since it really isn't anything PHP can affect..

 [2002-09-10 08:44 UTC] gamid at isayev dot net
> So you didn't try it..?
Yes, I set LC_ALL/LANG to 'en_US' and try it.

> I only tried your test script and got the expected result.
> Whatever the characters are.. I've no idea of them anyway..
I think your confused by looking on the result of test script with encoding set to 'ISO-8859-x' instead of 'UTF-8'.
In this case it looks as some characters changed to lower/upper case.
BUT they are not UTF-8 lower/upper case characters:
1) 0xC39C changed to 0xE39C, should be 0xC3BC
2) 0xD0AF changed to 0xF0AF, should be 0xD18F
As result we have not UTF-8 string but a garbage.
If you really like test this issue you should set 'default_charset=utf-8' in php.ini or set encoding to 'UTF-8' in your browser.

> btw. AFAIK, setting LANG / LC_ALL to UTF-8 is not correct
> way to do it.. 
> According to that HOWTO, it should be something like
> ru_RU.UTF-8 (and only if you really have UTF-8 locales)
I try en_US.UTF-8, de_DE.UTF-8, ru_RU.UTF-8 - no lack.

> I'm bogusing this since it really isn't anything PHP can
> affect..
So, no way in PHP convert UTF-8 string to lower/upper case?
 [2002-09-10 08:52 UTC] wez@php.net
This is not a bug in PHP; it's down to whether your system
can support this and has the appropriate locales installed.

A quick and dirty example might look this this in C:

#include <ctype.h>
main()
{
   char buff[1024];

   while(fgets(buff, sizeof(buff), stdin)) {
      int i, l;
      l = strlen(buff);
      for (i = 0; i < l; i++)
          buff[i] = toupper(buff[i]);
      puts(buff);
   }
}

If that little program works, your system supports
this conversion.  If it doesn't, then PHP doesn't
either.

 [2002-09-10 08:54 UTC] wez@php.net
I forgot to add that you should feed your utf8 data to the
input of that little program.
 [2002-09-10 09:20 UTC] gamid at isayev dot net
As I understand toupper()/tolower() are working only for one byte encodings. So right way is to use 'wide' versions of toupper()/tolower() - towupper()/towlower().
Example:

#include <stdio.h>
#include <wctype.h>
#include <locale.h>

int main() {
printf("locale set to '%s'\n", setlocale(LC_ALL, "UTF-8"));

printf("0x00DC C='%C'\n", towlower(0x00DC));
printf("0x042F C='%C'\n", towlower(0x042F));

return(0);
}

And it's working fine for UCS2 (UTF-16).
In PHP I can convert UTF-8 to UTF-16 by using iconv().
But PHP has not 'wide' version of strtolower()/strtoupper().
So, what can I do?
 [2002-09-25 20:23 UTC] wez@php.net
I've added a new function to the mbstring extension in CVS.
This function will be in PHP 4.3.

I would appreciate your feedback.
Try a snapshot from http://snaps.php.net/php4-latest.tar.gz
dated after this message.

usage:
proto string mb_convert_case(string str, int mode [, string encoding]);

mode can be one of MB_CASE_UPPER, MB_CASE_LOWER or MB_CASE_TITLE.
encoding specifies the encoding of str; if omitted, the
mbstring.internal_encoding value will be used.
The return value is str with the appropriate case folding applied.

The function works by internally converting the string into UCS-4 format
and applying php_unicode_to(upper|lower|title) to each unicode character,
and then converts the string back into the original encoding.

The code for your test case would look like this
(and works for me):

<?
$str = "Test".utf8_encode("\xFC");

$strU = mb_convert_case($str, MB_CASE_UPPER, "utf-8");
$strL = mb_convert_case($str, MB_CASE_LOWER, "utf-8");
?>
<PRE>
str  = '<? echo $str;  ?>'
strU = '<? echo $strU; ?>'
strL = '<? echo $strL; ?>'
</PRE>
 [2002-10-07 16:36 UTC] gamid at isayev dot net
Works fine for German and Russian characters.
Thans!
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 15:01:30 2024 UTC