PHP :: Doc Bug #80977 :: mb_ereg_search

Doc Bug #80977	mb_ereg_search_pos is not multibyte-safe
Submitted:	2021-04-22 15:08 UTC	Modified:	2021-04-22 16:37 UTC
From:	v dot picture at free dot fr	Assigned:	cmb (profile)
Status:	Not a bug	Package:	Regexps related
PHP Version:	8.0.3	OS:	Linux
Private report:	No	CVE-ID:	None

View Developer Edit

[2021-04-22 15:08 UTC] v dot picture at free dot fr

Description:
------------
Hello,
The mb_ereg_search_pos function is absolutely not multibyte-safe, it actually returns the position of the match as if the string was not multibyte.

The results of this function are exactly the same as if you were using preg_match_all with PREG_OFFSET_CAPTURE, even with the "unicode" flag it's simply NOT working.

Test script:
---------------
$string = 'jème lé ponés';
mb_ereg_search_init($string, '(?<=[ ^])\w+'); // Detect words
while ($pos = mb_ereg_search_pos()) {
    $match = mb_ereg_search_getregs()[0];
    $matchBasedOnPos = mb_substr($string, $pos[0], $pos[1]);
    if ($matchBasedOnPos !== $match) {
        throw new \LogicException("Match based on position '{$matchBasedOnPos}' does not correspond to actual match '{$match}'");
    }
}

Expected result:
----------------
No exception thrown

Actual result:
--------------
LogicException: Match based on position 'é p' does not correspond to actual match 'lé'

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2021-04-22 15:13 UTC] cmb@php.net

-Status: Open +Status: Not a bug -Type: Bug +Type: Documentation Problem -Assigned To: +Assigned To: cmb

[2021-04-22 15:13 UTC] cmb@php.net

That is documented[1]:

| An array containing two elements. The first element is the
| offset, in bytes, where the match begins relative to the start of
| the search string, and the second element is the length in bytes
| of the match.

[1] <https://www.php.net/mb_ereg_search_pos>

[2021-04-22 15:19 UTC] v dot picture at free dot fr

So basically this function returns a position in number of bytes, not in number of characters.
How is this useful to anyone?
Also, if I want to know the actual position - in number of characters - of a match, how do I proceed? I don't think there is any way to do that in PHP today.

[2021-04-22 15:28 UTC] cmb@php.net

Well, you can use the "regular" string functions on the results of
mb_ereg_search_pos(), e.g. <https://3v4l.org/oK639>.

If you feel the behavior of mb_ereg_search_pos() should be
changed, please pursue the RFC process[1].

[1] <https://wiki.php.net/rfc/howto>

[2021-04-22 16:37 UTC] v dot picture at free dot fr

Ok, thanks for you comment.
I would be glad to fix the behavior of this function, unfortunately the underlying code is a little bit too hardcore for me.

In my use-case I can't use substr because I want to compare string positions on multiple lines that may contain (or not) multi-byte characters. I can think of a way to do this by combining substr and mb_strlen to retrieve the actual position of a match but this seems like a very convoluted thing to do.

Maybe there should be a vote about this, whether or not this function should return a byte position or a character position.

In my opinion, even though it's indeed clearly stated in the documentation, it's really weird that I should use an unsafe function to "fix" the behavior of this function: I don't think ANYONE should continue to use multi-byte-unsafe function in their code.

Also, as mentioned before, I don't expect a multi-byte safe function to return the same output as an unsafe one (preg_match_all w/ PREG_OFFSET_CAPTURE).

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2026 The PHP Group All rights reserved.	Last updated: Thu Jun 25 14:00:01 2026 UTC