PHP :: Doc Bug #80166 :: preg_match_all()'s PREG_OFFSET

Doc Bug #80166	preg_match_all()'s PREG_OFFSET_CAPTURE ignoring modifier u
Submitted:	2020-10-01 12:02 UTC	Modified:	2020-10-03 21:19 UTC
From:	thomas at landauer dot at	Assigned:	cmb (profile)
Status:	Closed	Package:	PCRE related
PHP Version:	7.2.33	OS:	Linux
Private report:	No	CVE-ID:	None

View Developer Edit

[2020-10-01 12:02 UTC] thomas at landauer dot at

Description:
------------
PREG_OFFSET_CAPTURE always counts utf-8 special characters as *2* (bytes) - no matter if the modifier `u` (PCRE_UTF8) is present or not.

This has been reported before at https://bugs.php.net/bug.php?id=37391 but got closed as "Not a bug". However, taking a look at the documentation clearly shows that this *is* a bug:

https://www.php.net/manual/en/function.preg-match-all.php#refsect1-function.preg-match-all-parameters says about PREG_OFFSET_CAPTURE

> If this flag is passed, for every occurring match the appendant string offset will also be returned.

"string offset" obviously refers to the number of *characters* in front of the match.

And https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php says about u (PCRE_UTF8):

> Pattern and subject strings are treated as UTF-8.


My actual PHP version is 7.2.24

Test script:
---------------
preg_match_all('/a/u', 'öa', $matches, PREG_OFFSET_CAPTURE);
var_dump($matches);

Expected result:
----------------
[1]=>int(1)

Actual result:
--------------
[1]=>int(2)
(=expected result without modifier u)

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2020-10-01 12:10 UTC] nikic@php.net

-Type: Bug +Type: Documentation Problem

[2020-10-01 12:10 UTC] nikic@php.net

The documentation should be clarified, but the current behavior is both intentional and preferable. The use of character offsets should be avoided wherever possible.

[2020-10-01 12:11 UTC] cmb@php.net

-Status: Open +Status: Verified -Assigned To: +Assigned To: cmb

[2020-10-01 12:14 UTC] cmb@php.net

> PREG_OFFSET_CAPTURE always counts utf-8 special characters as
> *2* (bytes)

To clarify: UTF-8 characters are not necessarily counter as 2
bytes.

[2020-10-01 12:16 UTC] nikic@php.net

Right, the clarification here should be that this is always a "byte offset" including in UTF8 mode.

[2020-10-01 12:20 UTC] phpdocbot@php.net

Automatic comment on behalf of cmb
Revision: http://git.php.net/?p=doc/en.git;a=commit;h=7d4c08228e566afb3caedd29cb837a2fa67fcfbf
Log: Fix #80166: preg_match_all()'s PREG_OFFSET_CAPTURE ignoring modifier u

[2020-10-01 12:20 UTC] phpdocbot@php.net

-Status: Verified +Status: Closed

[2020-10-01 12:31 UTC] thomas at landauer dot at

> The use of character offsets should be avoided wherever possible.

Why?
When using *byte* offsets to split a string, you might end up with an invalid string, due to some "half"-characters.

[2020-10-01 12:38 UTC] nikic@php.net

@thomas: If your pattern matches at a character boundary, then of course the returned byte offset will also always be located at a character boundary.

The only way you could end up with a byte offset that is not on a character boundary is if your pattern explicitly requests that by using a single code unit match (\C). If it does, the result would not even be representable with a character offset.

[2020-10-01 12:58 UTC] thomas at landauer dot at

What I meant: My current code works with character numbers (`mb_substr()`, `mb_strlen()`). If I switched to byte numbers, I'd have to change this to `substr()` and `strlen()`; and if anything goes wrong there, I'm not just extracting some wrong characters, but rather completely *destroying* the entire string...

So why is it preferable to work with byte offsets?

And what's the point in having a dedicated modifier for UTF-8, if it doesn't make a difference in the end?
I think you should support this modifier here too, and leave the decision (byte vs. character offsets) to the user. I mean: This is exactly the point of such a switch, isn't it?

[2020-10-01 13:11 UTC] cmb@php.net

First, all offsets in the PCRE extension are byte offsets.
Changing that would be a massive BC break.

Second, UTF-8 character offsets enforce sequential access to
characters and substrings, while byte offsets allow random access,
which is way faster.

If you still feel strongly that this should be changed, please
write a mail to the internals mailing list[1], since this
bugtracker is not suitable for this kind of discussion.

[1] <https://www.php.net/mailing-lists.php#internals>

[2020-10-01 13:44 UTC] nikic@php.net

> And what's the point in having a dedicated modifier for UTF-8, if it doesn't make a difference in the end?
> I think you should support this modifier here too, and leave the decision (byte vs. character offsets) to the user. I mean: This is exactly the point of such a switch, isn't it?

The modifier controls how the pattern and input string are interpreted. You want /[äöü]/ to match the characters ä, ö and ü, not their constituent bytes in the UTF-8 encoding. It has no relation at all to the meaning of offsets, which are always proper byte offsets.

[2020-10-01 14:25 UTC] phpdocbot@php.net

Automatic comment on behalf of mumumu
Revision: http://git.php.net/?p=doc/ja.git;a=commit;h=326a1d72dcef26f2461a23cf3b4897fab41f3375
Log: Fix #80166: preg_match_all()'s PREG_OFFSET_CAPTURE ignoring modifier u

[2020-10-03 21:19 UTC] thomas at landauer dot at

Here's what I posted to the "php.internals" mailing list: https://news-web.php.net/php.internals/111983

[2020-12-30 11:58 UTC] nikic@php.net

Automatic comment on behalf of mumumu
Revision: http://git.php.net/?p=doc/ja.git;a=commit;h=ae8729d4a4b4a16f4944c5fd8e269c8afac97634
Log: Fix #80166: preg_match_all()'s PREG_OFFSET_CAPTURE ignoring modifier u

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2026 The PHP Group All rights reserved.	Last updated: Mon Jun 22 00:00:01 2026 UTC