|  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #61860 Offsets may be wrong for grapheme_stri* functions
Submitted: 2012-04-26 20:18 UTC Modified: 2013-06-24 06:30 UTC
Avg. Score:2.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:0 (0.0%)
Same OS:0 (0.0%)
From: poinsot dot julien at gmail dot com Assigned: stas
Status: Closed Package: intl (PECL)
PHP Version: Irrelevant OS:
Private report: No CVE-ID:
 [2012-04-26 20:18 UTC] poinsot dot julien at gmail dot com
I don't kwnow if we really can qualify this of bug: full case folding may result in wrong offsets calculation on the few code points which expand to more than 1 code points (up to 3). For example, "ß" is expanded to "ss": the length is not anymore the same, so grapheme_stri* functions may give wrong (user-expected) results.

A simple "workaround" could be a simple case folding, even if it is more limited.

Test script:
$haystack = 'Auf der Straße nach Paris habe ich mit dem Fahrer gesprochen';
    grapheme_stristr($haystack, 'Paris '),
    grapheme_substr($haystack, grapheme_stripos($haystack, 'Paris'))

Expected result:
string(40) "Paris habe ich mit dem Fahrer gesprochen"
string(40) "Paris habe ich mit dem Fahrer gesprochen"

Actual result:
string(39) "aris habe ich mit dem Fahrer gesprochen"
string(39) "aris habe ich mit dem Fahrer gesprochen"


grapheme_util.c (last revision 2012-04-26 20:19 UTC) by poinsot dot julien at gmail dot com)

Add a Patch

Pull Requests

Pull requests:

Add a Pull Request


AllCommentsChangesGit/SVN commitsRelated reports
 [2012-04-27 07:46 UTC]
-Assigned To: +Assigned To: stas
 [2013-05-27 08:29 UTC] okin7 at yahoo dot fr
Btw, here is a mbstring based PHP implementation that should do the job correctly:

    function grapheme_stripos($s, $needle, $offset = 0)
        if ($offset < 0) $offset = 0;
        if (!$needle = mb_stripos($s, $needle, $offset, 'UTF-8')) return $needle;
        return grapheme_strlen(mb_substr($s, 0, $needle, 'UTF-8'));

    function grapheme_strripos($s, $needle, $offset = 0)
        if ($offset < 0) $offset = 0;
        if (!$needle = mb_strripos($s, $needle, $offset, 'UTF-8')) return $needle;
        return grapheme_strlen(mb_substr($s, 0, $needle, 'UTF-8'));
 [2013-06-24 06:30 UTC]
-Status: Assigned +Status: Feedback
 [2013-06-24 06:30 UTC]
Could you please check the attached pull request and see if it fixes your issues?
 [2013-06-28 20:56 UTC]
Automatic comment on behalf of stas
Log: fix bug #61860: use USearch for searches, it does the right thing
 [2013-06-28 20:56 UTC]
-Status: Feedback +Status: Closed
 [2013-06-28 21:21 UTC] poinsot dot julien at gmail dot com
It seems ok.

Besides, it should work, if we consider UTS #10:

Ideally, the UCA at a secondary level would be compatible with the standard Unicode case folding and removal of compatibility differences, especially for the purpose of matching. For the vast majority of characters, it is compatible, but there are few exceptions.

PHP Copyright © 2001-2014 The PHP Group
All rights reserved.
Last updated: Thu Apr 17 16:02:22 2014 UTC