php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #73123 Substr documentation is wrong and misleading
Submitted: 2016-09-20 10:49 UTC Modified: 2020-10-01 12:07 UTC
Votes:3
Avg. Score:3.7 ± 0.9
Reproduced:1 of 1 (100.0%)
Same Version:1 (100.0%)
Same OS:1 (100.0%)
From: gregoire dot daussin at gmail dot com Assigned: girgias (profile)
Status: Assigned Package: Strings related
PHP Version: Irrelevant OS:
Private report: No CVE-ID: None
 [2016-09-20 10:49 UTC] gregoire dot daussin at gmail dot com
Description:
------------
---
From manual page: http://www.php.net/function.substr
---
Substr documentation talks about characters everywhere, while the function does not cut string characters wise but bytes wise, this is especially true for UTF-8.
The documentation should talk about bytes, and make a note about mb_substr.
Please fix, it is confusing, I've lied to since years :(.



Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2017-01-28 14:06 UTC] cmb@php.net
-Package: Documentation problem +Package: Strings related
 [2018-12-30 03:44 UTC] girgias@php.net
-Status: Open +Status: Assigned -PHP Version: 5.6.26 +PHP Version: Irrelevant -Assigned To: +Assigned To: girgias
 [2018-12-30 03:44 UTC] girgias@php.net
I will have a look at it, however, this applies to all non-mb_ functions.
 [2020-10-01 07:13 UTC] balazs dot kovacs at gmail dot com
Please fix this. I just ran into this issue after a very arduous debugging process which could have been cut short if there was any mention in the documentation about the mb_substr! Not to mention that my substr returned null and threw nothing while this behaviour is not in the documentation. The string I was running the function on had simple 'ä' (U+00E4) characters in them, UTF8 encoded. Also this has previously not caused any issues running through substr.
 [2020-10-01 12:07 UTC] girgias@php.net
Can you provide an explicit example as to when substr() returns null? Because that shouldn't happen from my understanding.

mb_substr() is already in the See Also section so I'm not sure what more you want for pointing in this direction, as there are various other options too, namely iconv_substr() and grapheme_substr().

The mention that a string is the same as a byte is documented on the string type page: https://www.php.net/language.types.string.

Moreover, UTF-8 has *multiple* valid encodings for a single "character" (quoting because a character is a vary nebulous concept see https://utf8everywhere.org/#characters)

So what is likely is that instead of having ä encoded as a single code-point (i.e. a byte for this SPECIFIC case) it is encoded as the code-point for 'a' followed by the diacritic modifier code-point '¨' thus taking at least 2 bytes.

The main reason this hasn't been fixed because the proposed fix implies changing not just the documentation of substr() but every part of the manual mentioning "characters" when it talks about 1-byte encodings, which seems counterproductive, as this detail is mentioned on the string type page.

Another solution would to have a note on every page about strings being byte-arrays, but that's again counterproductive.

A different note could be a warning about encodings, but that's a whole different topic which is, at least in my eyes, kinda irrelevant to the topic at hand and is very complicated.

And mentioning the other functions well, there is at least the mb_ variant in the See Also section as mentioned before.

One possibly reasonable solution is to include an example with a multi-bytes character to highlight this.
 [2020-10-01 13:01 UTC] balazs dot kovacs at talokuntoon dot fi
I cannot provide an example as it is a last name of our user and it falls under GDPR.

The name had a format of similar to "Bbbbbbb Bbbäbä Bbbbbä" and substr was called within an array declaration on the string variable like $myArr = ["myKey" => substr($customerName,0,28)];

Also this resulted in the array becoming null as well as the customer name the substr return value was assigned to.

Also as I mentioned the 'ä' characters in the name are all single character unicode U+00E4 characters.

As far as I can tell there should be no reason for substr to not be replaced actually with mb_substr as they are functionally identical and mb_substr simply doesn't break on handling modern encodings.
 [2020-10-01 13:15 UTC] girgias@php.net
I cannot reproduce your issue as can be seen here: https://3v4l.org/ohLHT

So I seriously doubt that your ä is actually encoded as U+00E4

All the mb_* family of functions need to detect the encoding if you do not pass it explicitly (and even then some might need to do some conversion to handle the specific encoding) and is thus slower than their normal counterpart in some case.
 [2020-10-02 05:09 UTC] balazs dot kovacs at talokuntoon dot fi
Could you please add some note or something into the substr's page to let people know that substr does not work with modern encodings and doesn't count letters but characters explicitly, like how you mentioned the composite letters? And a mention of having many other versions of substr() like mb_substr(), () and grapheme_substr()?

This would help out everyone who doesn't know of the archaic quirks of substr and would help them see that they should use functions that are aligned with modern encodings and expectations.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 11:01:29 2024 UTC