php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #42290 mb_eregi_replace() is not case-insensitive with multibyte pattern
Submitted: 2007-08-14 01:28 UTC Modified: 2011-10-15 09:07 UTC
Votes:9
Avg. Score:4.7 ± 0.7
Reproduced:7 of 7 (100.0%)
Same Version:2 (28.6%)
Same OS:3 (42.9%)
From: arysin at gmail dot com Assigned: hirokawa (profile)
Status: Closed Package: mbstring related
PHP Version: 5.2CVS-2007-08-14 OS: *
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: arysin at gmail dot com
New email:
PHP Version: OS:

 

 [2007-08-14 01:28 UTC] arysin at gmail dot com
Description:
------------
The function mb_eregi_replace() and/or function mb_ereg_replace() with
'i' option is not caseinsensitive for multibyte characters.
The same problem occurs for preg_replace() with /i option.

This bug was reported before twice:
1) (#39999) and was marked as Bogus stating it's not php bug. 
2) (#25953) was marked as Won't fix with note "Probably the issue will
be resolved in php5."

As one cannot add anything to closed bug I'd like to reopen this bug here for the reason stated below:

This library is not linked dynamically, on contrary its source is present in php source repository. So there's no way for php users to have this problem fixed without php itself being fixed.
More than that, the version of oniguruma in php repository is pretty old so at least importing the newer version of it would be a nice try to fix this bug.


Reproduce code:
---------------
<?

mb_internal_encoding('UTF-8');
mb_regex_encoding('UTF-8');

$pattern = '?'; //s 'crown'
$replace = 'X';

$subject = '?iltas, ?iltas';

$result = mb_eregi_replace($pattern, $replace, $subject);

echo $result;

$result = preg_replace("/$pattern/iu", $replace, $subject);

echo $result . "\n";
?>


Expected result:
----------------
Xiltas, Xiltas
Xiltas, Xiltas


Actual result:
--------------
?iltas, Xiltas
?iltas, Xiltas


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2007-08-14 09:11 UTC] jani@php.net
I get this as output:

?iltas, Xiltas
Xiltas, Xiltas

So I don't think there's anything wrong with PCRE, just mbstring stuff.
 [2007-08-17 13:46 UTC] jani@php.net
Assigned to the maintainer of mbstring extension.
 [2007-08-19 02:27 UTC] hirokawa@php.net
I got the same result as arysin,
(PHP_5_2CVS 20070819, PCRE 7.2 2007-06-19)

?iltas, Xiltas
?iltas, Xiltas

I think PCRE is also not working.

Jani, which version of PHP/PCRE you are using ?

 [2007-08-19 20:05 UTC] jani@php.net
I'm using the bundled PCRE library. I don't remember what the version is.
 [2007-08-21 15:46 UTC] hirokawa@php.net
arysin,

What kind of encoding you are using ?

For UTF-8 and ISO-8859-1, 0x8a is assigned to Line Tab.

  c.f.: http://en.wikipedia.org/wiki/ISO_8859-1
       http://en.wikipedia.org/wiki/UTF-8

In my understanding, 0x8a shouldn't be interpreted as
upper letter of 0x9a for ISO-8859-1/UTF-8.

If you are using CP1252 (Windows-1252), it is understandable,
but, CP1252 is not supported yet in the Oniguruma library
(multibyte regex engine of mbstring).
http://en.wikipedia.org/wiki/Windows-1252


 [2007-09-12 01:00 UTC] php-bugs at lists dot php dot net
No feedback was provided for this bug for over a week, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
 [2008-05-03 07:38 UTC] admin at bg-history dot info
I got the same problem with UTF-8 encoding, using Cyrillic.

While trying to make "search highlight" neither "eregi_replace", nor "str-ireplace" functions actually "got" the capital letter...

for example:

$str="&#1086;&#1073;&#1097;&#1080;";

$newstr="&#1054;&#1073;&#1097;&#1080;";

$bodytext = str_ireplace($str, "<span style=\"color: #FF0000\">".$str."</span>", $bodytext);

$bodytext2 = str_ireplace($newstr, "<span style=\"color: #FF0000\">".$newstr."</span>", $bodytext);

in $bodytext there is a word "&#1054;&#1073;&#1097;&#1080;". Although I used case insensitive replace, only in $bodytext2 the word is highlighted.

I've searched a lot for an issue, that solves that problem, and found none. 

P.S. Sorry for my English, hope it's understandable.
 [2009-04-15 16:04 UTC] rvorojbit at gmail dot com
I am also having the exact same problem now as was described in the previous post last year!!! Is there any workaround for this bug? I didn't find any in google...
 [2009-09-30 13:12 UTC] babson at gmail dot com
I am using PHP version 5.2.9 and have the same problem.
I tried sample by arysin and got the same result as he did.

What can be done?
 [2010-08-27 16:22 UTC] bubalula at gmail dot com
I have the same problem in version 5.2.12.
I don't know why this bug isn't taken seriously as it creates big problems for us working with non latin languages.
 [2010-08-27 16:36 UTC] bubalula at gmail dot com
I tried also on another server with php version 5.2.11 and it does not work either.
 [2010-08-28 03:20 UTC] hirokawa@php.net
-Status: No Feedback +Status: Assigned
 [2010-08-28 03:20 UTC] hirokawa@php.net
Could you show me the detailed information such as, 

- code snippet which can reproduce the problem.
- setting information of mbstring.* in php.ini
- character encoding which you are using.
- version/locale of your OS.
 [2010-10-13 01:29 UTC] gevorg dot ha at gmail dot com
Hi, 

please find code snippet which shoes that it doesn't work:

mb_internal_encoding("UTF-8");
mb_regex_encoding("UTF-8");

// Text contains three words with same letters, only with some uppercases.
$hText = 'ՀԱՅԱՍՏԱՆԸ Հայաստան հայաստան';

// None of these two is working and only the last word is being replaced.
echo mb_eregi_replace ('հայաստան', '<strong>\\0</strong>', $hText).'<br/>'; 
echo mb_ereg_replace ('հայաստան', '<strong>\\0</strong>', $hText, 
'msri').'<br/>'; 

Best,
Gevorg
 [2011-10-15 08:55 UTC] hirokawa@php.net
Automatic comment from SVN on behalf of hirokawa
Revision: http://svn.php.net/viewvc/?view=revision&amp;revision=318132
Log: updated bundled oniguruma regex library to 5.9.2. fixed bug #42290.
 [2011-10-15 09:07 UTC] hirokawa@php.net
-Status: Assigned +Status: Closed
 [2011-10-15 09:07 UTC] hirokawa@php.net
This bug has been fixed in SVN.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.

 For Windows:

http://windows.php.net/snapshots/
 
Thank you for the report, and for helping us make PHP better.

Prior to PHP 5.4.0, the case-insensitive match of Unicode except for LATIN-1 area was not supported by the bundled multibyte regex library (Oniguruma 4.7.2).
The Oniguruma library was updated to the newest version (5.9.2) which fully supports the Unicode property.
 [2012-04-18 09:48 UTC] laruence@php.net
Automatic comment on behalf of hirokawa
Revision: http://git.php.net/?p=php-src.git;a=commit;h=fe92d64a4ad700082b1e805f381183884fb7dbe1
Log: updated bundled oniguruma regex library to 5.9.2. fixed bug #42290.
 [2012-07-24 23:39 UTC] rasmus@php.net
Automatic comment on behalf of hirokawa
Revision: http://git.php.net/?p=php-src.git;a=commit;h=fe92d64a4ad700082b1e805f381183884fb7dbe1
Log: updated bundled oniguruma regex library to 5.9.2. fixed bug #42290.
 [2013-11-17 09:35 UTC] laruence@php.net
Automatic comment on behalf of hirokawa
Revision: http://git.php.net/?p=php-src.git;a=commit;h=fe92d64a4ad700082b1e805f381183884fb7dbe1
Log: updated bundled oniguruma regex library to 5.9.2. fixed bug #42290.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sat Apr 20 16:01:29 2024 UTC