php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #80977 mb_ereg_search_pos is not multibyte-safe
Submitted: 2021-04-22 15:08 UTC Modified: 2021-04-22 16:37 UTC
From: v dot picture at free dot fr Assigned: cmb (profile)
Status: Not a bug Package: Regexps related
PHP Version: 8.0.3 OS: Linux
Private report: No CVE-ID: None
View Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: v dot picture at free dot fr
New email:
PHP Version: OS:

 

 [2021-04-22 15:08 UTC] v dot picture at free dot fr
Description:
------------
Hello,
The mb_ereg_search_pos function is absolutely not multibyte-safe, it actually returns the position of the match as if the string was not multibyte.

The results of this function are exactly the same as if you were using preg_match_all with PREG_OFFSET_CAPTURE, even with the "unicode" flag it's simply NOT working.

Test script:
---------------
$string = 'jème lé ponés';
mb_ereg_search_init($string, '(?<=[ ^])\w+'); // Detect words
while ($pos = mb_ereg_search_pos()) {
    $match = mb_ereg_search_getregs()[0];
    $matchBasedOnPos = mb_substr($string, $pos[0], $pos[1]);
    if ($matchBasedOnPos !== $match) {
        throw new \LogicException("Match based on position '{$matchBasedOnPos}' does not correspond to actual match '{$match}'");
    }
}

Expected result:
----------------
No exception thrown

Actual result:
--------------
LogicException: Match based on position 'é p' does not correspond to actual match 'lé'

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2021-04-22 15:13 UTC] cmb@php.net
-Status: Open +Status: Not a bug -Type: Bug +Type: Documentation Problem -Assigned To: +Assigned To: cmb
 [2021-04-22 15:13 UTC] cmb@php.net
That is documented[1]:

| An array containing two elements. The first element is the
| offset, in bytes, where the match begins relative to the start of
| the search string, and the second element is the length in bytes
| of the match.

[1] <https://www.php.net/mb_ereg_search_pos>
 [2021-04-22 15:19 UTC] v dot picture at free dot fr
So basically this function returns a position in number of bytes, not in number of characters.
How is this useful to anyone?
Also, if I want to know the actual position - in number of characters - of a match, how do I proceed? I don't think there is any way to do that in PHP today.
 [2021-04-22 15:28 UTC] cmb@php.net
Well, you can use the "regular" string functions on the results of
mb_ereg_search_pos(), e.g. <https://3v4l.org/oK639>.

If you feel the behavior of mb_ereg_search_pos() should be
changed, please pursue the RFC process[1].

[1] <https://wiki.php.net/rfc/howto>
 [2021-04-22 16:37 UTC] v dot picture at free dot fr
Ok, thanks for you comment.
I would be glad to fix the behavior of this function, unfortunately the underlying code is a little bit too hardcore for me.

In my use-case I can't use substr because I want to compare string positions on multiple lines that may contain (or not) multi-byte characters. I can think of a way to do this by combining substr and mb_strlen to retrieve the actual position of a match but this seems like a very convoluted thing to do.

Maybe there should be a vote about this, whether or not this function should return a byte position or a character position.

In my opinion, even though it's indeed clearly stated in the documentation, it's really weird that I should use an unsafe function to "fix" the behavior of this function: I don't think ANYONE should continue to use multi-byte-unsafe function in their code.

Also, as mentioned before, I don't expect a multi-byte safe function to return the same output as an unsafe one (preg_match_all w/ PREG_OFFSET_CAPTURE).
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Sep 12 02:01:26 2024 UTC