php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #25669 eregi() vs. 8-bit chars in regex
Submitted: 2003-09-26 08:20 UTC Modified: 2003-10-01 05:21 UTC
From: svs at ropnet dot ru Assigned:
Status: Closed Package: Regexps related
PHP Version: 4.3.3 OS: FreeBSD 4.8
Private report: No CVE-ID: None
 [2003-09-26 08:20 UTC] svs at ropnet dot ru
Description:
------------
Even though locale is set up correctly, eregi() fails to match international characters case-insensitively.  The reason, as far
as I understand, is that code in regex/ passes a negative value to isalpha(). This can be worked around by recompiling regex/regcomp.c manually with -funsigned-char (assuming GCC is the compiler).


Reproduce code:
---------------
<?php
setlocale(LC_ALL, "ru_RU.KOI8-R"); 
echo setlocale(LC_ALL, ""), "\n";
if (eregi("&#1103;", "&#1071;&#1071;")) { echo "ok\n"; } else { echo "bad\n";}
if (preg_match("/&#1103;/i", "&#1071;&#1071;")) { echo "ok\n"; } else { echo "bad\n";}
?>


Expected result:
----------------
ru_RU.KOI8-R
ok
ok


Actual result:
--------------
ru_RU.KOI8-R
bad
ok


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2003-09-26 09:13 UTC] sniper@php.net
I don't think you meant to use those chars in your example
script..? Can you please add the actual ones here?

 [2003-09-26 09:16 UTC] sniper@php.net
And what was the configure line used to configure PHP?

 [2003-09-26 09:34 UTC] svs at ropnet dot ru
oops, mozilla mangled those characters.

begin 644 l.php
M/#]P:'`*<V5T;&]C86QE*$Q#7T%,3"P@(G)U7U)5+DM/23@M4B(I.R`*96-H
M;R!S971L;V-A;&4H3$-?04Q,+"`B(BDL(")<;B(["FEF("AE<F5G:2@BT2(L
M("+Q\2(I*2![(&5C:&\@(F]K7&XB.R!](&5L<V4@>R!E8VAO(")B861<;B([
M?0II9B`H<')E9U]M871C:"@B+]$O:2(L("+Q\2(I*2![(&5C:&\@(F]K7&XB
=.R!](&5L<V4@>R!E8VAO(")B861<;B([?0H_/@H`
`
end

'./configure' '--without-x' '--disable-debug' '--with-apxs=/usr/local/apache/bin/apxs' '--with-mod_charset' '--enable-dba' '--with-gdbm=/usr/local' '--with-db4=/usr/local' '--enable-dbase' '--enable-ftp' '--enable-sockets' '--enable-inline-optimization' '--enable-memory-limit' '--with-mysql' '--with-gd' '--enable-gd-native-ttf' '--with-zlib=/usr' '--with-jpeg-dir=/usr/local' '--with-png-dir=/usr/local' '--with-freetype-dir=/usr/local' '--enable-exif' '--enable-calendar' '--enable-wddx' '--with-gmp' '--with-openssl=/usr' '--with-iconv=/usr/local' '--with-imap=shared,/usr/local' '--with-curl=/usr/local' '--with-dom=shared,/usr/local' '--with-dom-xslt=shared,/usr/local' '--with-dom-exslt=shared,/usr/local' '--enable-xslt=shared' '--with-xslt-sablot=shared,/usr/local' '--with-iconv-dir=/usr/local' '--with-expat-dir=/usr/local' '--with-zip=/usr/local' '--with-pdflib' '--with-tiff-dir=/usr/local'
 [2003-09-28 20:14 UTC] iliaa@php.net
Try this patch and see if it fixes the problem.
http://bb.prohost.org/reg.txt
 [2003-09-29 06:23 UTC] svs at ropnet dot ru
No, it does not.
 [2003-09-29 08:08 UTC] moriyoshi@php.net
Ilia: your patch doesn't seem to deal with it correctly, as isalpha() expects signed integer indeed. A char value can be any of the numbers, -128 to 127, so if you cast it to unsigned integer, you never got a value in range of 0 to 255. So you should first cast it to unsigned char, and then make it signed integer.
 [2003-09-29 11:17 UTC] moriyoshi@php.net
Can you try this one again:
http://www.voltex.jp/patches/regpatch.diff

Note: This problem is known to not be reproduced with glibc and unfortunately I don't have a freebsd box atm.

 [2003-09-29 12:03 UTC] svs at ropnet dot ru
This one is OK.  Should I write a test case?
 [2003-09-29 12:11 UTC] moriyoshi@php.net
Yup, if possible. The following is just a template (supposed to be put in ext/standard/tests/reg). Please try to avoid using non-ascii characters in the test case. Thanks in advance.

--TEST--
Bug #25669 (eregi() with non-ascii characters)
--SKIP--
<?php
setlocale(LC_ALL, "de_DE.ISO8859-1") || die('SKIP de_DE.ISO8859-1 locale not sup
ported by this system');
?>
--FILE--
<?php
setlocale(LC_ALL, "de_DE.ISO8859-1");
var_dump((bool)eregi("\xc4\xcb\xf6", "\xe4\xeb\xd6"));
var_dump((bool)eregi("\xc4", "\xe1"));
var_dump((bool)preg_match("/\xc4\xcb\xf6/i", "\xe4\xeb\xd6"));
var_dump((bool)preg_match("/\xc4/i", "\xe1"));
?>
--EXPECT--
bool(true)
bool(false)
bool(true)
bool(false)

 [2003-10-01 04:41 UTC] svs at ropnet dot ru
This template is a complete test case actually. I did not have to modify it in any way.
 [2003-10-01 05:21 UTC] moriyoshi@php.net
Ok, I'm closing the bug.
Thanks for helping make php better.

(The fix will go to 4.3.4-rc2 or later versions)

 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 12:01:29 2024 UTC