php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #68447 grapheme_extract take an extra trailing character
Submitted: 2014-11-19 05:38 UTC Modified: -
Votes:4
Avg. Score:3.0 ± 0.7
Reproduced:2 of 2 (100.0%)
Same Version:0 (0.0%)
Same OS:0 (0.0%)
From: masakielastic at gmail dot com Assigned:
Status: Closed Package: intl (PECL)
PHP Version: 5.6.3 OS: Mac OS X
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: masakielastic at gmail dot com
New email:
PHP Version: OS:

 

 [2014-11-19 05:38 UTC] masakielastic at gmail dot com
Description:
------------
grapheme_extract take an extra trailing character when string contains variation selectors supplement (U+E0100...U+E01EF). Variation selectors supplement is used for last names and place names in Japanese.

https://en.wikipedia.org/Variation_Selectors_Supplement_(Unicode_block)

Test data, Katsushika-ku ("葛\xF3\xA0\x84\x81飾区" - U+845B U+E0101 U+98FE U+533A) is a place name in Tokyo, Japan.

https://en.wikipedia.org/wiki/Katsushika

One of the famous companies in Katsushika-ku is Tomy Company, Ltd.

https://en.wikipedia.org/Tomy

You can see glyph variant for U+845B U+E0101 in the following site.

http://ivd.dicey.org/17408

Test script:
---------------
grapheme_extract("葛\xF3\xA0\x84\x81飾区", 1);

Expected result:
----------------
"葛\xF3\xA0\x84\x81"

Actual result:
--------------
"葛\xF3\xA0\x84\x81飾"

Patches

Pull Requests

Pull requests:

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2016-07-12 09:19 UTC] kentaro at ranvis dot com
PR has been added.
This bug occurs on any > U+FFFF characters and returned value can include more than one extraneous characters.

<?php

$latin = 'AAABBBBB';
var_dump(mb_strlen(grapheme_extract($latin, 3))); // 3

$latin = 'AAABBBBB';
var_dump(mb_strlen(grapheme_extract($latin, 3))); // 3

$latin = '????????';
var_dump(mb_strlen(grapheme_extract($latin, 3))); // should be 3, got 6

$emoticon = '?';
var_dump(grapheme_extract($emoticon, 1, GRAPHEME_EXTR_MAXCHARS)); /* should be '?', got '' */

$emoticon = '??';
var_dump(grapheme_extract($emoticon, 4, GRAPHEME_EXTR_MAXBYTES)); /* should be '?', got '' */

$k1024 = '?????';
if (grapheme_strlen($k1024) == 3) {
    var_dump(grapheme_extract($k1024, 1)); /* should be '??', got '????' */
} else {
    echo "ICU doesn't support Unicode 7.0\n";
}
 [2016-11-27 23:38 UTC] stas@php.net
Automatic comment on behalf of kentaro@ranvis.com
Revision: http://git.php.net/?p=php-src.git;a=commit;h=df683fa3b0a66d7626d6720858f309da9e7880e5
Log: Fix #68447: grapheme_extract take an extra trailing character
 [2016-11-27 23:38 UTC] stas@php.net
-Status: Open +Status: Closed
 [2016-11-30 23:14 UTC] davey@php.net
Automatic comment on behalf of kentaro@ranvis.com
Revision: http://git.php.net/?p=php-src.git;a=commit;h=df683fa3b0a66d7626d6720858f309da9e7880e5
Log: Fix #68447: grapheme_extract take an extra trailing character
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 10:01:29 2024 UTC