php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #79308 Unexpected length of unicode string returned
Submitted: 2020-02-26 12:50 UTC Modified: 2020-02-26 14:53 UTC
From: dpiekarski at dompie dot de Assigned: cmb (profile)
Status: Not a bug Package: *Unicode Issues
PHP Version: 7.4.3 OS: Ubuntu 18.04.4 LTS
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: dpiekarski at dompie dot de
New email:
PHP Version: OS:

 

 [2020-02-26 12:50 UTC] dpiekarski at dompie dot de
Description:
------------
Getting the grapheme_strlen() of the string 'नमस्ते' returns 3 instead of expecting 4.

ICU version 	65.1
ICU Data version 	65.1
ICU TZData version 	2019c
ICU Unicode version 	12.1 
iconv library version => 2.27

On another system (Debian 10) with PHP7.3 and
ICU version => 63.1
ICU Data version => 63.1
ICU TZData version => 2018e
ICU Unicode version => 11.0
iconv library version => 2.28

it works as expected.


Test script:
---------------
<?php

$word = 'नमस्ते';
var_dump(grapheme_strlen($word));

Expected result:
----------------
int(4)

Actual result:
--------------
int(3)

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2020-02-26 14:53 UTC] cmb@php.net
-Status: Open +Status: Not a bug -Assigned To: +Assigned To: cmb
 [2020-02-26 14:53 UTC] cmb@php.net
This is not related to the PHP version; e.g. PHP 7.3 with ICU
66.0.1 prints int(3) as well, so it's obviously an upstream issue.

नमस्ते
 [2020-03-25 22:38 UTC] srl295 at gmail dot com
This is a User Perceived Character, aka Extended Grapheme Cluster, see http://www.unicode.org/reports/tr29/

`स्ते` is one cluster. If you use a Unicode-aware (especially GUI) text editor with the arrow keys through the string नमस्ते , you will see that the cursor and selection don't break up between the  "m" and the "ste" 

See for example http://www.unicode.org/reports/tr29/#Table_Sample_Grapheme_Clusters where "षि" is one grapheme cluster.


So I would say this is not a bug.
 [2020-03-26 10:17 UTC] dpiekarski at dompie dot de
So you basically say, that the last both grapheme clusters on this page are basically one?
https://symfony.com/doc/current/components/string.html#what-is-a-string

Especially this image:
https://symfony.com/doc/current/_images/bytes-points-graphemes.png

I'm still not convinced. On my system (ubuntu18) in every possible editor and PHPStorm I have 4 grapheme clusters as shown in the linked image above.

And you say, this is all wrong? I don't know that language nor the word, but technically it looks to me as it should be a length of 4. The last both letters of the word 'नमस्ते' are already grapheme clusters.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Tue Mar 19 05:01:29 2024 UTC