php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #63079 String access by character is not multibyte-safe
Submitted: 2012-09-13 09:58 UTC Modified: 2012-11-14 02:10 UTC
From: astatutov at gmail dot com Assigned: aharvey (profile)
Status: Closed Package: Documentation problem
PHP Version: Irrelevant OS:
Private report: No CVE-ID: None
View Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: astatutov at gmail dot com
New email:
PHP Version: OS:

 

 [2012-09-13 09:58 UTC] astatutov at gmail dot com
Description:
------------
I know, there is section named "Details of the String Type" in documentation. But still there is other section, that stats "Think of a string as an array of characters for this purpose". This is very convenient to think so. We use mbstring extension to work entirely on utf-8 and mbstring.func_overload option allows us almost forget about differences between regular and multibyte strings. We just write our application, thinking about its native logic, not PHP internal logic. This is high-level programming language, by the way. We're using strlen, substr, etc. as we're doing with regular strings. And BANG! String bracket operator returns bytes, not characters! 

I think it's unpredictable behavior, even if it's well-documented (but it's not). Considering that the use of utf-8 grows everywhere and maybe even PHP 6 will support it by default, why not implement multibyte support in bracket operations now in mbstring extension? Of course, it must be configurable to be back-compatible. I know, we can use substr as a replace of string accessing operation, but it's very slow and it's wrong in general.

Also I now this is not a first bug on this subject. There was #51919 as example, which was closed and marked as not a bug. But I propose to look at this problem from the point of view of the language logic, not the implementation.

Sorry, if I've missed something else. 

Test script:
---------------
$str = "Kąt";
echo $str[1];

Expected result:
----------------
ą

Actual result:
--------------
�

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2012-09-13 10:48 UTC] laruence@php.net
yeah, it's not.
you should use mb_* to deal with multi-byte characters
 [2012-09-13 11:57 UTC] astatutov at gmail dot com
> you should use mb_* to deal with multi-byte characters

I know it. I mentioned it in the description. The option mbstring.func_overload do it for me. But bracket operator is still unusable: the documentation states it accesses the character while it doesn't. And I believe it's not the documentation problem. Any modern language I know which is able to work with utf-8 do it transparently for developer. The aim of mbstring is the same, isn't it? Setting mbstring.internal_encoding to utf-8 a developer will expect that INTERNAL string accessing operator will support it. This is what the term "predictable behavior" means.
 [2012-09-13 14:12 UTC] laruence@php.net
as the option self said *mbstring*.internal_encoding, not php.internal_encoding...
 [2012-09-13 19:27 UTC] astatutov at gmail dot com
*mbstring* just determines which module will read this option. It doesn't say which module it will affect. Say, option mbstring.func_overload affects whole php, because it overrides native functions. Option mbstring.http_input changes default php behavior when reading HTTP-request and so on. So why can't mbstring.func_overload or, say, mbstring.op_overload override the string accessing operation?
 [2012-11-13 10:49 UTC] Matti dot jarvinen at nitroid dot fi
Under "String access and modification by character" at 
http://php.net/manual/en/language.types.string.php there is no mention about [] 
syntax not being multibyte safe.

At least make this a documentation issue.
 [2012-11-14 02:03 UTC] aharvey@php.net
-Status: Open +Status: Assigned -Type: Bug +Type: Documentation Problem -Package: Strings related +Package: Documentation problem -Assigned To: +Assigned To: aharvey
 [2012-11-14 02:03 UTC] aharvey@php.net
It does say right at the top of the string type page that a character is a byte, but I'll add a warning to the character access section to be extra clear.
 [2012-11-14 02:10 UTC] aharvey@php.net
Automatic comment from SVN on behalf of aharvey
Revision: http://svn.php.net/viewvc/?view=revision&revision=328351
Log: Warn users about the perils of multi-byte strings and direct character access.

Or, as xkcd put it today, if it starts pointing toward space you are having a
bad problem and you will not go to space today.

Fixes doc bug #63079 (String access by character is not multibyte-safe).
 [2012-11-14 02:10 UTC] aharvey@php.net
-Status: Assigned +Status: Closed
 [2012-11-14 02:10 UTC] aharvey@php.net
This bug has been fixed in the documentation's XML sources. Since the
online and downloadable versions of the documentation need some time
to get updated, we would like to ask you to be a bit patient.

Thank you for the report, and for helping us make our documentation better.


 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Thu Jul 17 10:01:30 2025 UTC