PHP :: Bug #47480 :: preg_replace with "/i" is not case insensitive

Bug #47480

preg_replace with "/i" is not case insensitive

Submitted:

2009-02-23 13:32 UTC

Modified:

2009-03-12 18:46 UTC

Votes:	4
Avg. Score:	3.5 ± 1.7
Reproduced:	3 of 4 (75.0%)
Same Version:	2 (66.7%)
Same OS:	3 (100.0%)

From:

sehh at ionos dot gr

Assigned:

Status:

Not a bug

Package:

PCRE related

PHP Version:

5.2.8

OS:

Linux

Private report:

CVE-ID:

None

View Developer Edit

Anyone can comment on a bug. Have a simpler test case? Does it work for you on a different platform? Let us know!
Just going to say 'Me too!'? Don't clutter the database with that please !

Your email address: MUST BE VALID
Solve the problem: 20 - 9 = ?
Subscribe to this entry?

[2009-02-23 13:32 UTC] sehh at ionos dot gr

Description:
------------
preg_replace with the "/i" (case insensitive search) does not do a case insensitive search for UTF-8 Greek characters, while it works fine for English characters.


Reproduce code:
---------------
<?php
$string = "?? ?????? ????? ??? ????????, ???? ??? ???????????? ???? ??????????"; // UTF-8 string in Greek language
$target1 = "????????"; // Target string to search for (capitalized)
$target2 = "????????"; // Target string to search for (small letters)
$replace = "itworks"; // Replace with this string

$rc = preg_replace("/$target1/imsUu", $replace, $string, -1, $counter); // Execute search for target1 and replace

echo "\nSearching for: ".$target1."\n"; // Report output
echo "Result string: ".$rc."\n";
echo "Found and replaced: ".$counter."\n";

$rc = preg_replace("/$target2/imsUu", $replace, $string, -1, $counter); // Execute search for target2 and replace

echo "\nSearching for: ".$target2."\n"; // Report output
echo "Result string: ".$rc."\n";
echo "Found and replaced: ".$counter."\n\n";
?>

Expected result:
----------------
I expect the Found and Replaced to be both "1" since the expression is not case sensitive.

Actual result:
--------------
$ php -f test.php 

Searching for: ????????
Result string: ?? ?????? ????? ??? ????????, ???? ??? ???????????? ???? ??????????
Found and replaced: 0

Searching for: ????????
Result string: ?? ?????? ????? ??? itworks, ???? ??? ???????????? ???? ??????????
Found and replaced: 1

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2009-03-09 11:59 UTC] mmcnickle at gmail dot com

The test case is wrong and the bug should be closed. The upper case search target is misspelled.

$target1 = "????????";
$target2 = "????????";
should read
$target1 = "????????";
$target2 = "????????";

(note the replacement of the second ? with a capital Thorn (U+00DE).

With this change I get the expected result:

Actual Result
-------------

Searching for: ????????
Result string: ?? ?????? ????? ??? itworks, ???? ??? ???????????? ????
??????????
Found and replaced: 1

Searching for: ????????
Result string: ?? ?????? ????? ??? itworks, ???? ??? ???????????? ????
??????????
Found and replaced: 1

[2009-03-09 12:16 UTC] sehh at ionos dot gr

Obviously you have no idea what you are talking about and obviously you don't speak Greek or know anything about the Greek language.

The word "????????" is capitalized as "????????".

What you are suggesting is like capitalizing the word "engine" as "ENGiNE".

Obviously, there is no word "ENGiNE", same way there is no word "????????" :)

[2009-03-09 14:31 UTC] mmcnickle at gmail dot com

You're absolutely correct, I do not speak Greek. But neither does the PCRE library. It determines the uppercase/lowercase relationship between characters solely using Unicode properties.

The lowercase of ? is defined in Unicode as ? [1], not ?. Therefore the case-insensitive search will not match.

[1]http://www.fileformat.info/info/unicode/char/00c7/index.htm

[2009-03-09 14:54 UTC] sehh at ionos dot gr

The PCRE library is wrong then.

"?" is correctly defined in Unicode as "?", but the library should also understand the meaning of "?" == "?" == "?".

This counts for all Greek accents:

"?" == "?" == "?"
etc...

Otherwise, the parameter "/i" is useless for the Greek language and thats why the current implementation does not work for Greek.

Thank you for taking the time to look into this issue, much appreciated.

[2009-03-09 15:00 UTC] sehh at ionos dot gr

I forgot the capital accented characters, so the above should read:

"?" == "?" == "?" == "?"
"?" == "?" == "?" == "?"
etc..

Remember that in Greek, the accent may be omitted from capital letters or may be included for the first letter only. So that should produce proper case-insensitive results.

[2009-03-09 15:25 UTC] mmcnickle at gmail dot com

Yes, unfortunately trying to include locale and language specific cases is next to impossible for regular expression engine developers. 

The best that can be done, though far from ideal, is for the user to try to take these changes into account when they are crafting the regex:

$target1 = "?????[?|?]??"; // Greek;

$target1 = "Stra[ss|?]ebahn" // German

[2009-03-09 16:01 UTC] sehh at ionos dot gr

Indeed thats far from ideal, its impossible from my development point of view to re-write every single accented character with its possible equivalent for the entire string, for every string in the regex.

For example, this:
/???????? ?????????-????????/i

Would become a monster like this:
/????[?|?|?]?[?|?|?]? ???????[?|?|?]?-??????[?|?|?]?/i

We would need a regex to create the regex! or at least a text search/replace method in PHP.

Are you sure its impossible to add a few exceptions within the PCRE library?

[2009-03-09 17:20 UTC] mmcnickle at gmail dot com

It wouldn't be impossible, no. But to someone without detailed knowledge of Greek it would be. The unicode.org article on regular expressions [1] has this to say:

"All of the above deals with a default specification for a regular expression. However, a regular expression engine also may want to support tailored specifications, typically tailored for a particular language or locale. This may be important when the regular expression engine is being used by end-users instead of programmers, such as in a word-processor allowing some level of regular expressions in searching."

Earlier in the document it says about how basic regex engines are only required to include the basic unicode uppercase/lowercase matching.

Looking though the source code of the PRCE library, it does seem possible to generate locale-specific character tables; this may be an avenue to look into.

Perhaps the best thing to do would be to drop a message in the internationalization mailing list (http://marc.info/?l=php-i18n) and see what they have to say.

[1] http://unicode.org/reports/tr18/#Tailored_Support

[2009-03-12 09:39 UTC] sehh at ionos dot gr

Do you think it would be better if I contacted the developers of the PCRE library at http://www.pcre.org/ ?

Maybe submitting a patch or bug report to them would cover a lot more open source projects, instead of patching the PCRE library used by php only.

[2009-03-12 18:46 UTC] nlopess@php.net

not an issue in php. check the unicode standard.

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Wed Jul 02 12:01:36 2025 UTC