php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #80977 mb_ereg_search_pos is not multibyte-safe
Submitted: 2021-04-22 15:08 UTC Modified: 2021-04-22 16:37 UTC
From: v dot picture at free dot fr Assigned: cmb (profile)
Status: Not a bug Package: Regexps related
PHP Version: 8.0.3 OS: Linux
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: v dot picture at free dot fr
New email:
PHP Version: OS:

 

 [2021-04-22 15:08 UTC] v dot picture at free dot fr
Description:
------------
Hello,
The mb_ereg_search_pos function is absolutely not multibyte-safe, it actually returns the position of the match as if the string was not multibyte.

The results of this function are exactly the same as if you were using preg_match_all with PREG_OFFSET_CAPTURE, even with the "unicode" flag it's simply NOT working.

Test script:
---------------
$string = 'jème lé ponés';
mb_ereg_search_init($string, '(?<=[ ^])\w+'); // Detect words
while ($pos = mb_ereg_search_pos()) {
    $match = mb_ereg_search_getregs()[0];
    $matchBasedOnPos = mb_substr($string, $pos[0], $pos[1]);
    if ($matchBasedOnPos !== $match) {
        throw new \LogicException("Match based on position '{$matchBasedOnPos}' does not correspond to actual match '{$match}'");
    }
}

Expected result:
----------------
No exception thrown

Actual result:
--------------
LogicException: Match based on position 'é p' does not correspond to actual match 'lé'

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2021-04-22 15:13 UTC] cmb@php.net
-Status: Open +Status: Not a bug -Type: Bug +Type: Documentation Problem -Assigned To: +Assigned To: cmb
 [2021-04-22 15:13 UTC] cmb@php.net
That is documented[1]:

| An array containing two elements. The first element is the
| offset, in bytes, where the match begins relative to the start of
| the search string, and the second element is the length in bytes
| of the match.

[1] <https://www.php.net/mb_ereg_search_pos>
 [2021-04-22 15:19 UTC] v dot picture at free dot fr
So basically this function returns a position in number of bytes, not in number of characters.
How is this useful to anyone?
Also, if I want to know the actual position - in number of characters - of a match, how do I proceed? I don't think there is any way to do that in PHP today.
 [2021-04-22 15:28 UTC] cmb@php.net
Well, you can use the "regular" string functions on the results of
mb_ereg_search_pos(), e.g. <https://3v4l.org/oK639>.

If you feel the behavior of mb_ereg_search_pos() should be
changed, please pursue the RFC process[1].

[1] <https://wiki.php.net/rfc/howto>
 [2021-04-22 16:37 UTC] v dot picture at free dot fr
Ok, thanks for you comment.
I would be glad to fix the behavior of this function, unfortunately the underlying code is a little bit too hardcore for me.

In my use-case I can't use substr because I want to compare string positions on multiple lines that may contain (or not) multi-byte characters. I can think of a way to do this by combining substr and mb_strlen to retrieve the actual position of a match but this seems like a very convoluted thing to do.

Maybe there should be a vote about this, whether or not this function should return a byte position or a character position.

In my opinion, even though it's indeed clearly stated in the documentation, it's really weird that I should use an unsafe function to "fix" the behavior of this function: I don't think ANYONE should continue to use multi-byte-unsafe function in their code.

Also, as mentioned before, I don't expect a multi-byte safe function to return the same output as an unsafe one (preg_match_all w/ PREG_OFFSET_CAPTURE).
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Mon Nov 04 12:01:28 2024 UTC