php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #47480 preg_replace with "/i" is not case insensitive
Submitted: 2009-02-23 13:32 UTC Modified: 2009-03-12 18:46 UTC
Votes:4
Avg. Score:3.5 ± 1.7
Reproduced:3 of 4 (75.0%)
Same Version:2 (66.7%)
Same OS:3 (100.0%)
From: sehh at ionos dot gr Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 5.2.8 OS: Linux
Private report: No CVE-ID: None
View Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: sehh at ionos dot gr
New email:
PHP Version: OS:

 

 [2009-02-23 13:32 UTC] sehh at ionos dot gr
Description:
------------
preg_replace with the "/i" (case insensitive search) does not do a case insensitive search for UTF-8 Greek characters, while it works fine for English characters.


Reproduce code:
---------------
<?php
$string = "?? ?????? ????? ??? ????????, ???? ??? ???????????? ???? ??????????"; // UTF-8 string in Greek language
$target1 = "????????"; // Target string to search for (capitalized)
$target2 = "????????"; // Target string to search for (small letters)
$replace = "itworks"; // Replace with this string

$rc = preg_replace("/$target1/imsUu", $replace, $string, -1, $counter); // Execute search for target1 and replace

echo "\nSearching for: ".$target1."\n"; // Report output
echo "Result string: ".$rc."\n";
echo "Found and replaced: ".$counter."\n";

$rc = preg_replace("/$target2/imsUu", $replace, $string, -1, $counter); // Execute search for target2 and replace

echo "\nSearching for: ".$target2."\n"; // Report output
echo "Result string: ".$rc."\n";
echo "Found and replaced: ".$counter."\n\n";
?>

Expected result:
----------------
I expect the Found and Replaced to be both "1" since the expression is not case sensitive.

Actual result:
--------------
$ php -f test.php 

Searching for: ????????
Result string: ?? ?????? ????? ??? ????????, ???? ??? ???????????? ???? ??????????
Found and replaced: 0

Searching for: ????????
Result string: ?? ?????? ????? ??? itworks, ???? ??? ???????????? ???? ??????????
Found and replaced: 1


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2009-03-09 11:59 UTC] mmcnickle at gmail dot com
The test case is wrong and the bug should be closed. The upper case search target is misspelled.

$target1 = "????????";
$target2 = "????????";
should read
$target1 = "????????";
$target2 = "????????";

(note the replacement of the second ? with a capital Thorn (U+00DE).

With this change I get the expected result:

Actual Result
-------------

Searching for: ????????
Result string: ?? ?????? ????? ??? itworks, ???? ??? ???????????? ????
??????????
Found and replaced: 1

Searching for: ????????
Result string: ?? ?????? ????? ??? itworks, ???? ??? ???????????? ????
??????????
Found and replaced: 1
 [2009-03-09 12:16 UTC] sehh at ionos dot gr
Obviously you have no idea what you are talking about and obviously you don't speak Greek or know anything about the Greek language.

The word "????????" is capitalized as "????????".

What you are suggesting is like capitalizing the word "engine" as "ENGiNE".

Obviously, there is no word "ENGiNE", same way there is no word "????????" :)
 [2009-03-09 14:31 UTC] mmcnickle at gmail dot com
You're absolutely correct, I do not speak Greek. But neither does the PCRE library. It determines the uppercase/lowercase relationship between characters solely using Unicode properties.

The lowercase of ? is defined in Unicode as ? [1], not ?. Therefore the case-insensitive search will not match.

[1]http://www.fileformat.info/info/unicode/char/00c7/index.htm
 [2009-03-09 14:54 UTC] sehh at ionos dot gr
The PCRE library is wrong then.

"?" is correctly defined in Unicode as "?", but the library should also understand the meaning of "?" == "?" == "?".

This counts for all Greek accents:

"?" == "?" == "?"
etc...

Otherwise, the parameter "/i" is useless for the Greek language and thats why the current implementation does not work for Greek.

Thank you for taking the time to look into this issue, much appreciated.
 [2009-03-09 15:00 UTC] sehh at ionos dot gr
I forgot the capital accented characters, so the above should read:

"?" == "?" == "?" == "?"
"?" == "?" == "?" == "?"
etc..

Remember that in Greek, the accent may be omitted from capital letters or may be included for the first letter only. So that should produce proper case-insensitive results.
 [2009-03-09 15:25 UTC] mmcnickle at gmail dot com
Yes, unfortunately trying to include locale and language specific cases is next to impossible for regular expression engine developers. 

The best that can be done, though far from ideal, is for the user to try to take these changes into account when they are crafting the regex:

$target1 = "?????[?|?]??"; // Greek;

$target1 = "Stra[ss|?]ebahn" // German
 [2009-03-09 16:01 UTC] sehh at ionos dot gr
Indeed thats far from ideal, its impossible from my development point of view to re-write every single accented character with its possible equivalent for the entire string, for every string in the regex.

For example, this:
/???????? ?????????-????????/i

Would become a monster like this:
/????[?|?|?]?[?|?|?]? ???????[?|?|?]?-??????[?|?|?]?/i

We would need a regex to create the regex! or at least a text search/replace method in PHP.

Are you sure its impossible to add a few exceptions within the PCRE library?
 [2009-03-09 17:20 UTC] mmcnickle at gmail dot com
It wouldn't be impossible, no. But to someone without detailed knowledge of Greek it would be. The unicode.org article on regular expressions [1] has this to say:

"All of the above deals with a default specification for a regular expression. However, a regular expression engine also may want to support tailored specifications, typically tailored for a particular language or locale. This may be important when the regular expression engine is being used by end-users instead of programmers, such as in a word-processor allowing some level of regular expressions in searching."

Earlier in the document it says about how basic regex engines are only required to include the basic unicode uppercase/lowercase matching.

Looking though the source code of the PRCE library, it does seem possible to generate locale-specific character tables; this may be an avenue to look into.

Perhaps the best thing to do would be to drop a message in the internationalization mailing list (http://marc.info/?l=php-i18n) and see what they have to say.

[1] http://unicode.org/reports/tr18/#Tailored_Support
 [2009-03-12 09:39 UTC] sehh at ionos dot gr
Do you think it would be better if I contacted the developers of the PCRE library at http://www.pcre.org/ ?

Maybe submitting a patch or bug report to them would cover a lot more open source projects, instead of patching the PCRE library used by php only.
 [2009-03-12 18:46 UTC] nlopess@php.net
not an issue in php. check the unicode standard.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Mon Dec 16 09:01:27 2024 UTC