php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #81437 mb_strtchr cutting differently on php 8.1
Submitted: 2021-09-14 13:22 UTC Modified: 2021-09-20 15:02 UTC
Votes:1
Avg. Score:5.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:1 (100.0%)
Same OS:1 (100.0%)
From: nicolasgrekas@php.net Assigned:
Status: Wont fix Package: *General Issues
PHP Version: 8.1.0RC1 OS:
Private report: No CVE-ID: None
Have you experienced this issue?
Rate the importance of this bug to you:

 [2021-09-14 13:22 UTC] nicolasgrekas@php.net
Description:
------------
echo mb_strrchr('déjàdéjà', 'é', false, 'ASCII');

echoes "à" on 8.1 but echoes "éjà" on previous versions.

See https://3v4l.org/jujrP


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2021-09-14 13:34 UTC] cmb@php.net
Shouldn't that return false, since neither haystack not needle are
ASCII encoded?
 [2021-09-14 13:45 UTC] nicolasgrekas@php.net
It could return false, but that's not the historical behavior apparently :)
 [2021-09-20 14:29 UTC] nikic@php.net
This should explain what is going on here: https://3v4l.org/lH2KZ

In PHP 8.1 the ASCII validation is stricter and input code units over 0x80 are considered as illegal. This means that both é and à become ?? after illegal character substitution.

If the desired behavior was to do a raw binary search, then the right encoding to use would be 8bit rather than ASCII.

I think the only open question here is whether we should make this fail in a different way. Generally mbstring operates on the GIGO principle when it comes to input strings that are incorrectly encoded, because validating them would add significant overhead to all operations. In this case we do already validate it due to conversion to UTF-8, so we could always report no match in that case. But  we wouldn't be able to guarantee that behavior either, because a future optimization to skip UTF-8 conversion for single-byte encodings would actually get back to the historical behavior.

Basically, if mb_check_encoding() for an input to an mbstring function returns false, behavior is undefined and is going to change depending on implementation details.
 [2021-09-20 14:37 UTC] nicolasgrekas@php.net
I'm wondering why mb_strrchr() needs to do any validation, but anyway: works for me, let's close if that's fine to you.
 [2021-09-20 15:02 UTC] nikic@php.net
-Status: Open +Status: Wont fix
 [2021-09-20 15:02 UTC] nikic@php.net
It doesn't need to do validation, it just currently happens to be implemented by converting the string to UTF-8 first, because that's necessary for strings encoded in non-self-synchronizing encodings. ASCII isn't one of those and we could skip it there, but right now we don't...
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Mar 28 09:01:26 2024 UTC