php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #61860 Offsets may be wrong for grapheme_stri* functions
Submitted: 2012-04-26 20:18 UTC Modified: 2013-06-24 06:30 UTC
Votes:1
Avg. Score:2.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:0 (0.0%)
Same OS:0 (0.0%)
From: poinsot dot julien at gmail dot com Assigned: stas
Status: Closed Package: intl (PECL)
PHP Version: Irrelevant OS:
Private report: No CVE-ID:
 [2012-04-26 20:18 UTC] poinsot dot julien at gmail dot com
Description:
------------
I don't kwnow if we really can qualify this of bug: full case folding may result in wrong offsets calculation on the few code points which expand to more than 1 code points (up to 3). For example, "ß" is expanded to "ss": the length is not anymore the same, so grapheme_stri* functions may give wrong (user-expected) results.

A simple "workaround" could be a simple case folding, even if it is more limited.

Test script:
---------------
$haystack = 'Auf der Straße nach Paris habe ich mit dem Fahrer gesprochen';
var_dump(
    grapheme_stristr($haystack, 'Paris '),
    grapheme_substr($haystack, grapheme_stripos($haystack, 'Paris'))
);

Expected result:
----------------
string(40) "Paris habe ich mit dem Fahrer gesprochen"
string(40) "Paris habe ich mit dem Fahrer gesprochen"

Actual result:
--------------
string(39) "aris habe ich mit dem Fahrer gesprochen"
string(39) "aris habe ich mit dem Fahrer gesprochen"

Patches

grapheme_util.c (last revision 2012-04-26 20:19 UTC) by poinsot dot julien at gmail dot com)

Add a Patch

Pull Requests

Pull requests:

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2012-04-27 07:46 UTC] cataphract@php.net
-Assigned To: +Assigned To: stas
 [2013-05-27 08:29 UTC] okin7 at yahoo dot fr
Btw, here is a mbstring based PHP implementation that should do the job correctly:

    function grapheme_stripos($s, $needle, $offset = 0)
    {
        if ($offset < 0) $offset = 0;
        if (!$needle = mb_stripos($s, $needle, $offset, 'UTF-8')) return $needle;
        return grapheme_strlen(mb_substr($s, 0, $needle, 'UTF-8'));
    }

    function grapheme_strripos($s, $needle, $offset = 0)
    {
        if ($offset < 0) $offset = 0;
        if (!$needle = mb_strripos($s, $needle, $offset, 'UTF-8')) return $needle;
        return grapheme_strlen(mb_substr($s, 0, $needle, 'UTF-8'));
    }
 [2013-06-24 06:30 UTC] stas@php.net
-Status: Assigned +Status: Feedback
 [2013-06-24 06:30 UTC] stas@php.net
Could you please check the attached pull request and see if it fixes your issues?
 [2013-06-28 20:56 UTC] stas@php.net
Automatic comment on behalf of stas
Revision: http://git.php.net/?p=php-src.git;a=commit;h=8aba119f5525663ab202e459929d7fb3271aef51
Log: fix bug #61860: use USearch for searches, it does the right thing
 [2013-06-28 20:56 UTC] stas@php.net
-Status: Feedback +Status: Closed
 [2013-06-28 21:21 UTC] poinsot dot julien at gmail dot com
It seems ok.

Besides, it should work, if we consider UTS #10:

Ideally, the UCA at a secondary level would be compatible with the standard Unicode case folding and removal of compatibility differences, especially for the purpose of matching. For the vast majority of characters, it is compatible, but there are few exceptions.

Thanks.
 
PHP Copyright © 2001-2014 The PHP Group
All rights reserved.
Last updated: Thu Apr 17 16:02:22 2014 UTC