php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #68447 grapheme_extract take an extra trailing character
Submitted: 2014-11-19 05:38 UTC Modified: -
Votes:4
Avg. Score:3.0 ± 0.7
Reproduced:2 of 2 (100.0%)
Same Version:0 (0.0%)
Same OS:0 (0.0%)
From: masakielastic at gmail dot com Assigned:
Status: Closed Package: intl (PECL)
PHP Version: 5.6.3 OS: Mac OS X
Private report: No CVE-ID:
 [2014-11-19 05:38 UTC] masakielastic at gmail dot com
Description:
------------
grapheme_extract take an extra trailing character when string contains variation selectors supplement (U+E0100...U+E01EF). Variation selectors supplement is used for last names and place names in Japanese.

https://en.wikipedia.org/Variation_Selectors_Supplement_(Unicode_block)

Test data, Katsushika-ku ("葛\xF3\xA0\x84\x81飾区" - U+845B U+E0101 U+98FE U+533A) is a place name in Tokyo, Japan.

https://en.wikipedia.org/wiki/Katsushika

One of the famous companies in Katsushika-ku is Tomy Company, Ltd.

https://en.wikipedia.org/Tomy

You can see glyph variant for U+845B U+E0101 in the following site.

http://ivd.dicey.org/17408

Test script:
---------------
grapheme_extract("葛\xF3\xA0\x84\x81飾区", 1);

Expected result:
----------------
"葛\xF3\xA0\x84\x81"

Actual result:
--------------
"葛\xF3\xA0\x84\x81飾"

Patches

Add a Patch

Pull Requests

Pull requests:

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2016-07-12 09:19 UTC] kentaro at ranvis dot com
PR has been added.
This bug occurs on any > U+FFFF characters and returned value can include more than one extraneous characters.

<?php

$latin = 'AAABBBBB';
var_dump(mb_strlen(grapheme_extract($latin, 3))); // 3

$latin = 'AAABBBBB';
var_dump(mb_strlen(grapheme_extract($latin, 3))); // 3

$latin = '🄐🄐🄐🄑🄑🄑🄑🄑';
var_dump(mb_strlen(grapheme_extract($latin, 3))); // should be 3, got 6

$emoticon = '😃';
var_dump(grapheme_extract($emoticon, 1, GRAPHEME_EXTR_MAXCHARS)); /* should be '😃', got '' */

$emoticon = '😃😟';
var_dump(grapheme_extract($emoticon, 4, GRAPHEME_EXTR_MAXBYTES)); /* should be '😃', got '' */

$k1024 = '𞣇𞣓𞣈𞣑𞣊';
if (grapheme_strlen($k1024) == 3) {
    var_dump(grapheme_extract($k1024, 1)); /* should be '𞣇𞣓', got '𞣇𞣓𞣈𞣑' */
} else {
    echo "ICU doesn't support Unicode 7.0\n";
}
 [2016-11-27 23:38 UTC] stas@php.net
Automatic comment on behalf of kentaro@ranvis.com
Revision: http://git.php.net/?p=php-src.git;a=commit;h=df683fa3b0a66d7626d6720858f309da9e7880e5
Log: Fix #68447: grapheme_extract take an extra trailing character
 [2016-11-27 23:38 UTC] stas@php.net
-Status: Open +Status: Closed
 [2016-11-30 23:14 UTC] davey@php.net
Automatic comment on behalf of kentaro@ranvis.com
Revision: http://git.php.net/?p=php-src.git;a=commit;h=df683fa3b0a66d7626d6720858f309da9e7880e5
Log: Fix #68447: grapheme_extract take an extra trailing character
 
PHP Copyright © 2001-2017 The PHP Group
All rights reserved.
Last updated: Wed Apr 26 17:01:37 2017 UTC