php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #79308 Unexpected length of unicode string returned
Submitted: 2020-02-26 12:50 UTC Modified: 2020-02-26 14:53 UTC
From: dpiekarski at dompie dot de Assigned: cmb (profile)
Status: Not a bug Package: *Unicode Issues
PHP Version: 7.4.3 OS: Ubuntu 18.04.4 LTS
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: dpiekarski at dompie dot de
New email:
PHP Version: OS:

 

 [2020-02-26 12:50 UTC] dpiekarski at dompie dot de
Description:
------------
Getting the grapheme_strlen() of the string 'नमस्ते' returns 3 instead of expecting 4.

ICU version 	65.1
ICU Data version 	65.1
ICU TZData version 	2019c
ICU Unicode version 	12.1 
iconv library version => 2.27

On another system (Debian 10) with PHP7.3 and
ICU version => 63.1
ICU Data version => 63.1
ICU TZData version => 2018e
ICU Unicode version => 11.0
iconv library version => 2.28

it works as expected.


Test script:
---------------
<?php

$word = 'नमस्ते';
var_dump(grapheme_strlen($word));

Expected result:
----------------
int(4)

Actual result:
--------------
int(3)

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2020-02-26 14:53 UTC] cmb@php.net
-Status: Open +Status: Not a bug -Assigned To: +Assigned To: cmb
 [2020-02-26 14:53 UTC] cmb@php.net
This is not related to the PHP version; e.g. PHP 7.3 with ICU
66.0.1 prints int(3) as well, so it's obviously an upstream issue.

नमस्ते
 [2020-03-25 22:38 UTC] srl295 at gmail dot com
This is a User Perceived Character, aka Extended Grapheme Cluster, see http://www.unicode.org/reports/tr29/

`स्ते` is one cluster. If you use a Unicode-aware (especially GUI) text editor with the arrow keys through the string नमस्ते , you will see that the cursor and selection don't break up between the  "m" and the "ste" 

See for example http://www.unicode.org/reports/tr29/#Table_Sample_Grapheme_Clusters where "षि" is one grapheme cluster.


So I would say this is not a bug.
 [2020-03-26 10:17 UTC] dpiekarski at dompie dot de
So you basically say, that the last both grapheme clusters on this page are basically one?
https://symfony.com/doc/current/components/string.html#what-is-a-string

Especially this image:
https://symfony.com/doc/current/_images/bytes-points-graphemes.png

I'm still not convinced. On my system (ubuntu18) in every possible editor and PHPStorm I have 4 grapheme clusters as shown in the linked image above.

And you say, this is all wrong? I don't know that language nor the word, but technically it looks to me as it should be a length of 4. The last both letters of the word 'नमस्ते' are already grapheme clusters.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Mon Oct 07 10:01:28 2024 UTC