php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #28220 mb_strwidth() returns wrong width values for some Hangul characters.
Submitted: 2004-04-29 18:48 UTC Modified: 2004-10-08 16:48 UTC
From: martin dot t dot kutschker at blackbox dot net Assigned:
Status: Closed Package: mbstring related
PHP Version: Irrelevant OS:
Private report: No CVE-ID: None
 [2004-04-29 18:48 UTC] martin dot t dot kutschker at blackbox dot net
Description:
------------
The table describing the width of the characters is wrong if you compare it with the table for Unicode 4.0:

http://www.unicode.org/Public/UNIDATA/EastAsianWidth.txt

For the BMP the wide/full-width chars are:

1100..115F  Hangul Choseong
2E80..4DB5  CJK radicals and CJK Ideograph Extension A
4E00..D7A3  CJK Ideographs, Yi syll. and Hangul syll.
F900..FA6A  CJK compatibiliy ideographs
FE30..FE6B  presentation forms, punctuations, etc.
FF01..FF60  full-width Latin letters
FFE0        FULLWIDTH CENT SIGN
FFE1        FULLWIDTH POUND SIGN
FFE2        FULLWIDTH NOT SIGN
FFE3        FULLWIDTH MACRON
FFE4        FULLWIDTH BROKEN BAR
FFE5        FULLWIDTH YEN SIGN
FFE6        FULLWIDTH WON SIGN

I didn't check what the actual implementation does, but the docs are certainly wrong (if they mean Unicoe codepoints).


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2004-05-01 15:30 UTC] moriyoshi@php.net
This is a valid bug.
# thanks Nuno.

 [2004-05-04 11:53 UTC] martin dot t dot kutschker at blackbox dot net
I rechecked EastAsianWidth and have found two more wide chars and noticed that the range 2E80..4DB5 is in fact split by a single half-width filler space char

1100..115F  Hangul Choseong
2329        LEFT-POINTING ANGLE BRACKET
232A        RIGHT-POINTING ANGLE BRACKET
2E80-303E   CJK and Kangxi radicals, ideographic chars
3041-4DB5   Hiragana, Katakana, Bopomofo and Hangul letters
4E00..D7A3  CJK ideographs, Yi and Hangul syllables
F900..FA6A  CJK compatibiliy ideographs
FE30..FE6B  presentation forms, punctuations, etc.
FF01..FF60  full-width Latin letters
FFE0        FULLWIDTH CENT SIGN
FFE1        FULLWIDTH POUND SIGN
FFE2        FULLWIDTH NOT SIGN
FFE3        FULLWIDTH MACRON
FFE4        FULLWIDTH BROKEN BAR
FFE5        FULLWIDTH YEN SIGN
FFE6        FULLWIDTH WON SIGN

Please also note that Unicode knows about "ambigous" (A) chars. See quotes from http://www.unicode.org/reports/tr11/

"In a broad sense, wide characters include W, F, and A (when in EA context), 
 while narrow characters include N, Na, H, and A (when not in EA context)."

"Ambiguous characters behave like wide or narrow characters depending on 
 context (language tag, script identification, associated font, source of 
 data, or explicit markup; all can provide the context). If the context 
 cannot be established reliably they should be treated as narrow characters 
 by default."

So mb_strwidth could try to auto-detect the context (eg. by locale) or have an optional east-asian context argument.
 [2004-06-29 14:25 UTC] moriyoshi@php.net
Try this patch and see if it works.

http://www.voltex.jp/patches/bug28220-
preliminary.patch.diff

This patch is only applicable for PHP 4.3.2 or later.


~/src/php-4.3.7 $ patch -p0 -R < bug28220-
preliminary.patch.diff

 [2004-07-07 01:00 UTC] php-bugs at lists dot php dot net
No feedback was provided for this bug for over a week, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
 [2004-07-10 20:41 UTC] martin dot t dot kutschker at blackbox dot net
I never tried the original code (only noticed the problem from reading the docs), so I did not test the diff. Anyway I'm offline for two weeks, so I won't be able to give the fix a try for some time.
 [2004-07-12 08:52 UTC] derick@php.net
We're still waiting for feedback, so leave it at that state.
 [2004-07-20 01:00 UTC] php-bugs at lists dot php dot net
No feedback was provided for this bug for over a week, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
 [2004-10-08 16:48 UTC] moriyoshi@php.net
This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.


 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Tue Mar 19 05:01:29 2024 UTC