php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #81437 mb_strtchr cutting differently on php 8.1
Submitted: 2021-09-14 13:22 UTC Modified: 2021-09-20 15:02 UTC
Votes:1
Avg. Score:5.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:1 (100.0%)
Same OS:1 (100.0%)
From: nicolasgrekas@php.net Assigned:
Status: Wont fix Package: *General Issues
PHP Version: 8.1.0RC1 OS:
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If this is not your bug, you can add a comment by following this link.
If this is your bug, but you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: nicolasgrekas@php.net
New email:
PHP Version: OS:

 

 [2021-09-14 13:22 UTC] nicolasgrekas@php.net
Description:
------------
echo mb_strrchr('déjàdéjà', 'é', false, 'ASCII');

echoes "à" on 8.1 but echoes "éjà" on previous versions.

See https://3v4l.org/jujrP


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2021-09-14 13:34 UTC] cmb@php.net
Shouldn't that return false, since neither haystack not needle are
ASCII encoded?
 [2021-09-14 13:45 UTC] nicolasgrekas@php.net
It could return false, but that's not the historical behavior apparently :)
 [2021-09-20 14:29 UTC] nikic@php.net
This should explain what is going on here: https://3v4l.org/lH2KZ

In PHP 8.1 the ASCII validation is stricter and input code units over 0x80 are considered as illegal. This means that both é and à become ?? after illegal character substitution.

If the desired behavior was to do a raw binary search, then the right encoding to use would be 8bit rather than ASCII.

I think the only open question here is whether we should make this fail in a different way. Generally mbstring operates on the GIGO principle when it comes to input strings that are incorrectly encoded, because validating them would add significant overhead to all operations. In this case we do already validate it due to conversion to UTF-8, so we could always report no match in that case. But  we wouldn't be able to guarantee that behavior either, because a future optimization to skip UTF-8 conversion for single-byte encodings would actually get back to the historical behavior.

Basically, if mb_check_encoding() for an input to an mbstring function returns false, behavior is undefined and is going to change depending on implementation details.
 [2021-09-20 14:37 UTC] nicolasgrekas@php.net
I'm wondering why mb_strrchr() needs to do any validation, but anyway: works for me, let's close if that's fine to you.
 [2021-09-20 15:02 UTC] nikic@php.net
-Status: Open +Status: Wont fix
 [2021-09-20 15:02 UTC] nikic@php.net
It doesn't need to do validation, it just currently happens to be implemented by converting the string to UTF-8 first, because that's necessary for strings encoded in non-self-synchronizing encodings. ASCII isn't one of those and we could skip it there, but right now we don't...
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sun Apr 28 03:01:28 2024 UTC