php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #65732 grapheme_*() is not Unicode compliant on CR LF sequence
Submitted: 2013-09-21 17:40 UTC Modified: 2016-08-20 01:23 UTC
Votes:1
Avg. Score:4.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:1 (100.0%)
Same OS:0 (0.0%)
From: poinsot dot julien at gmail dot com Assigned: cmb (profile)
Status: Closed Package: intl (PECL)
PHP Version: Irrelevant OS:
Private report: No CVE-ID: None
 [2013-09-21 17:40 UTC] poinsot dot julien at gmail dot com
Description:
------------
ASCII optimisation of grapheme_* functions count CR + LF sequence as 2 distinct graphemes but should count for 1 as defined by Unicode standard.

It may conduct to some strange results depending if the string contains a code point > 0x7F or not. Eg:
grapheme_strlen("\r\na") != grapheme_strlen("\r\né")

A workaround for now could be to append an invisible or whitespace character to the string, like:

function my_grapheme_strlen($string) {
    return grapheme_strlen($string . "\xef\xbb\xbf") - 1; // append ZERO WIDTH SPACE (U+200B)
}

Test script:
---------------
var_dump(grapheme_strlen("\r\n"));
var_dump(grapheme_substr(implode("\r\n", ['abc', 'def', 'ghi']), 5));

Expected result:
----------------
int(1)
string(7) "ef
ghi"

Actual result:
--------------
int(2)
string(8) "def
ghi"

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2016-08-19 16:18 UTC] cmb@php.net
-Status: Open +Status: Analyzed -Assigned To: +Assigned To: cmb
 [2016-08-19 16:18 UTC] cmb@php.net
Indeed, grapheme_ascii_check() may obviously return the wrong
length, see <https://3v4l.org/sBmOS>, and needs a special casing
for CRLF.
 [2016-08-19 17:06 UTC] cmb@php.net
-Summary: grapheme_*: ASCII optimisation is not Unicode compliant on CR LF sequence +Summary: grapheme_*() is not Unicode compliant on CR LF sequence
 [2016-08-20 01:22 UTC] cmb@php.net
Automatic comment on behalf of cmbecker69@gmx.de
Revision: http://git.php.net/?p=php-src.git;a=commit;h=e4a006cd3e17338677ec269a8cdb1354f38e0cad
Log: Fix #65732: grapheme_*() is not Unicode compliant on CR LF sequence
 [2016-08-20 01:22 UTC] cmb@php.net
-Status: Analyzed +Status: Closed
 [2016-10-17 10:09 UTC] bwoebi@php.net
Automatic comment on behalf of cmbecker69@gmx.de
Revision: http://git.php.net/?p=php-src.git;a=commit;h=e4a006cd3e17338677ec269a8cdb1354f38e0cad
Log: Fix #65732: grapheme_*() is not Unicode compliant on CR LF sequence
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 08:01:29 2024 UTC