PHP :: Bug #65732 :: grapheme_*() is not Unicode compliant on CR LF sequence

Bug #65732

grapheme_*() is not Unicode compliant on CR LF sequence

Submitted:

2013-09-21 17:40 UTC

Modified:

2016-08-20 01:23 UTC

Votes:	1
Avg. Score:	4.0 ± 0.0
Reproduced:	1 of 1 (100.0%)
Same Version:	1 (100.0%)
Same OS:	0 (0.0%)

From:

poinsot dot julien at gmail dot com

Assigned:

cmb (profile)

Status:

Closed

Package:

intl (PECL)

PHP Version:

Irrelevant

OS:

Private report:

CVE-ID:

None

View Developer Edit

Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.

Password:

Status:
Package:
Bug Type:
Summary:
From:	poinsot dot julien at gmail dot com
New email:
PHP Version:		OS:

New Comment:

[2013-09-21 17:40 UTC] poinsot dot julien at gmail dot com

Description:
------------
ASCII optimisation of grapheme_* functions count CR + LF sequence as 2 distinct graphemes but should count for 1 as defined by Unicode standard.

It may conduct to some strange results depending if the string contains a code point > 0x7F or not. Eg:
grapheme_strlen("\r\na") != grapheme_strlen("\r\né")

A workaround for now could be to append an invisible or whitespace character to the string, like:

function my_grapheme_strlen($string) {
    return grapheme_strlen($string . "\xef\xbb\xbf") - 1; // append ZERO WIDTH SPACE (U+200B)
}

Test script:
---------------
var_dump(grapheme_strlen("\r\n"));
var_dump(grapheme_substr(implode("\r\n", ['abc', 'def', 'ghi']), 5));

Expected result:
----------------
int(1)
string(7) "ef
ghi"

Actual result:
--------------
int(2)
string(8) "def
ghi"

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2016-08-19 16:18 UTC] cmb@php.net

-Status: Open +Status: Analyzed -Assigned To: +Assigned To: cmb

[2016-08-19 16:18 UTC] cmb@php.net

Indeed, grapheme_ascii_check() may obviously return the wrong
length, see <https://3v4l.org/sBmOS>, and needs a special casing
for CRLF.

[2016-08-19 17:06 UTC] cmb@php.net

-Summary: grapheme_*: ASCII optimisation is not Unicode compliant on CR LF sequence +Summary: grapheme_*() is not Unicode compliant on CR LF sequence

[2016-08-20 01:22 UTC] cmb@php.net

Automatic comment on behalf of cmbecker69@gmx.de
Revision: http://git.php.net/?p=php-src.git;a=commit;h=e4a006cd3e17338677ec269a8cdb1354f38e0cad
Log: Fix #65732: grapheme_*() is not Unicode compliant on CR LF sequence

[2016-08-20 01:22 UTC] cmb@php.net

-Status: Analyzed +Status: Closed

[2016-10-17 10:09 UTC] bwoebi@php.net

Automatic comment on behalf of cmbecker69@gmx.de
Revision: http://git.php.net/?p=php-src.git;a=commit;h=e4a006cd3e17338677ec269a8cdb1354f38e0cad
Log: Fix #65732: grapheme_*() is not Unicode compliant on CR LF sequence

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Sun Oct 26 01:00:01 2025 UTC