php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #67276 mb_strwidth count combining chars
Submitted: 2014-05-14 13:41 UTC Modified: 2021-08-16 17:03 UTC
From: nicolas dot grekas+php at gmail dot com Assigned:
Status: Open Package: mbstring related
PHP Version: 5.5.12 OS:
Private report: No CVE-ID: None
View Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: nicolas dot grekas+php at gmail dot com
New email:
PHP Version: OS:

 

 [2014-05-14 13:41 UTC] nicolas dot grekas+php at gmail dot com
Description:
------------
combining characters should account for zero width when counted with mb_strwidth.

Test script:
---------------
<?php

$a = 'é';

echo mb_strwidth($a, 'utf8'), "\n";

$b = Normalizer::normalize($a, Normalizer::NFD);

echo mb_strwidth($b, 'utf8'), "\n";


Expected result:
----------------
1
1

Actual result:
--------------
1
2

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2018-03-11 14:41 UTC] cmb@php.net
Not sure if this is a bug or rather a feature request.  Anyhow, I
guess it won't be fixed, since mbfl has no notion of combining
characters, generally, and the grapheme_*() functions already
cater to that.  Furthermore, mb_strwidth() appears to be useful
for monospaced fonts only.
 [2021-08-16 15:14 UTC] cmb@php.net
-Status: Open +Status: Not a bug -Assigned To: +Assigned To: cmb
 [2021-08-16 15:14 UTC] cmb@php.net
> Furthermore, mb_strwidth() appears to be useful for monospaced
> fonts only.

Not really, but still, there is nothing to fix here per my
previous comment.
 [2021-08-16 15:23 UTC] nicolasgrekas@php.net
But then, the function is just broken for the purpose it should serve.

For reference, there is standard on the topic, which is implemented in C at:
https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

and in Python at https://github.com/jquast/wcwidth.

In PHP, there is this:
https://github.com/symfony/string/blob/bd53358e3eccec6a670b5f33ab680d8dbe1d4ae1/AbstractUnicodeString.php#L508
 [2021-08-16 15:34 UTC] cmb@php.net
All of MBString is broken with regard to grapheme clusters; these
are not catered to by other MBString functions as well[1].  I
don't think this will ever change.

It might make sense to introduce grapheme_strwidth(), but
apparently ICU does not support that yet[2].  I'd rather avoid an
own implementation (or relying on wcwidth()), since these may
easily become outdated regarding new Unicode features.

Not sure what to do here.  Maybe you want to write to internals?

[1] <https://3v4l.org/taWVn>
[2] <https://unicode-org.atlassian.net/browse/ICU-12726>
 [2021-08-16 16:13 UTC] nikic@php.net
At least going by the referenced C implementation, this looks a lot simpler than grapheme clusters, so it's something we might include.

You mentioned that there is a standard for this, could you point me to it? I know about UAX#11, but I don't think that specifies an actual character width algorithm.
 [2021-08-16 16:19 UTC] cmb@php.net
-Status: Not a bug +Status: Open -Assigned To: cmb +Assigned To:
 [2021-08-16 16:19 UTC] cmb@php.net
> You mentioned that there is a standard for this, could you point
> me to it?

Likely
<https://pubs.opengroup.org/onlinepubs/9699919799/functions/wcwidth.html>
and
<https://pubs.opengroup.org/onlinepubs/9699919799/functions/wcswidth.html>.

Note, though, that <https://unicode-org.atlassian.net/browse/ICU-12726>
mentions that these implementations were out of date/limited.  Not sure
about the linked implementation.
 [2021-08-16 16:46 UTC] nikic@php.net
> Likely <https://pubs.opengroup.org/onlinepubs/9699919799/functions/wcwidth.html> and <https://pubs.opengroup.org/onlinepubs/9699919799/functions/wcswidth.html>.

Unless I'm missing something, it doesn't look like these actually specify how the functions are supposed to work...
 [2021-08-16 17:03 UTC] cmb@php.net
> Unless I'm missing something, it doesn't look like these
> actually specify how the functions are supposed to work...

Well, wcwidth() accepts a wchar_t, and to my knowledge, that would
be a single code point for Unicode encodings, so no grapheme
cluster support.

Nikolas may have referred to some other standard.
 [2021-08-16 17:09 UTC] nicolasgrekas@php.net
You've found the standard I had in mind.
I think the code doesn't care about grapheme clusters because counting a width of zero for combining chars does the work.
The map of combining chars is then enough there.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Mon Nov 25 08:01:32 2024 UTC