php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #65732 grapheme_*() is not Unicode compliant on CR LF sequence
Submitted: 2013-09-21 17:40 UTC Modified: 2016-08-20 01:23 UTC
Votes:1
Avg. Score:4.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:1 (100.0%)
Same OS:0 (0.0%)
From: poinsot dot julien at gmail dot com Assigned: cmb (profile)
Status: Closed Package: intl (PECL)
PHP Version: Irrelevant OS:
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: poinsot dot julien at gmail dot com
New email:
PHP Version: OS:

 

 [2013-09-21 17:40 UTC] poinsot dot julien at gmail dot com
Description:
------------
ASCII optimisation of grapheme_* functions count CR + LF sequence as 2 distinct graphemes but should count for 1 as defined by Unicode standard.

It may conduct to some strange results depending if the string contains a code point > 0x7F or not. Eg:
grapheme_strlen("\r\na") != grapheme_strlen("\r\né")

A workaround for now could be to append an invisible or whitespace character to the string, like:

function my_grapheme_strlen($string) {
    return grapheme_strlen($string . "\xef\xbb\xbf") - 1; // append ZERO WIDTH SPACE (U+200B)
}

Test script:
---------------
var_dump(grapheme_strlen("\r\n"));
var_dump(grapheme_substr(implode("\r\n", ['abc', 'def', 'ghi']), 5));

Expected result:
----------------
int(1)
string(7) "ef
ghi"

Actual result:
--------------
int(2)
string(8) "def
ghi"

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2016-08-19 16:18 UTC] cmb@php.net
-Status: Open +Status: Analyzed -Assigned To: +Assigned To: cmb
 [2016-08-19 16:18 UTC] cmb@php.net
Indeed, grapheme_ascii_check() may obviously return the wrong
length, see <https://3v4l.org/sBmOS>, and needs a special casing
for CRLF.
 [2016-08-19 17:06 UTC] cmb@php.net
-Summary: grapheme_*: ASCII optimisation is not Unicode compliant on CR LF sequence +Summary: grapheme_*() is not Unicode compliant on CR LF sequence
 [2016-08-20 01:22 UTC] cmb@php.net
Automatic comment on behalf of cmbecker69@gmx.de
Revision: http://git.php.net/?p=php-src.git;a=commit;h=e4a006cd3e17338677ec269a8cdb1354f38e0cad
Log: Fix #65732: grapheme_*() is not Unicode compliant on CR LF sequence
 [2016-08-20 01:22 UTC] cmb@php.net
-Status: Analyzed +Status: Closed
 [2016-10-17 10:09 UTC] bwoebi@php.net
Automatic comment on behalf of cmbecker69@gmx.de
Revision: http://git.php.net/?p=php-src.git;a=commit;h=e4a006cd3e17338677ec269a8cdb1354f38e0cad
Log: Fix #65732: grapheme_*() is not Unicode compliant on CR LF sequence
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 15:01:30 2024 UTC