php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #80166 preg_match_all()'s PREG_OFFSET_CAPTURE ignoring modifier u
Submitted: 2020-10-01 12:02 UTC Modified: 2020-10-03 21:19 UTC
From: thomas at landauer dot at Assigned: cmb (profile)
Status: Closed Package: PCRE related
PHP Version: 7.2.33 OS: Linux
Private report: No CVE-ID: None
 [2020-10-01 12:02 UTC] thomas at landauer dot at
Description:
------------
PREG_OFFSET_CAPTURE always counts utf-8 special characters as *2* (bytes) - no matter if the modifier `u` (PCRE_UTF8) is present or not.

This has been reported before at https://bugs.php.net/bug.php?id=37391 but got closed as "Not a bug". However, taking a look at the documentation clearly shows that this *is* a bug:

https://www.php.net/manual/en/function.preg-match-all.php#refsect1-function.preg-match-all-parameters says about PREG_OFFSET_CAPTURE

> If this flag is passed, for every occurring match the appendant string offset will also be returned.

"string offset" obviously refers to the number of *characters* in front of the match.

And https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php says about u (PCRE_UTF8):

> Pattern and subject strings are treated as UTF-8.


My actual PHP version is 7.2.24

Test script:
---------------
preg_match_all('/a/u', 'öa', $matches, PREG_OFFSET_CAPTURE);
var_dump($matches);

Expected result:
----------------
[1]=>int(1)

Actual result:
--------------
[1]=>int(2)
(=expected result without modifier u)

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2020-10-01 12:10 UTC] nikic@php.net
-Type: Bug +Type: Documentation Problem
 [2020-10-01 12:10 UTC] nikic@php.net
The documentation should be clarified, but the current behavior is both intentional and preferable. The use of character offsets should be avoided wherever possible.
 [2020-10-01 12:11 UTC] cmb@php.net
-Status: Open +Status: Verified -Assigned To: +Assigned To: cmb
 [2020-10-01 12:14 UTC] cmb@php.net
> PREG_OFFSET_CAPTURE always counts utf-8 special characters as
> *2* (bytes)

To clarify: UTF-8 characters are not necessarily counter as 2
bytes.
 [2020-10-01 12:16 UTC] nikic@php.net
Right, the clarification here should be that this is always a "byte offset" including in UTF8 mode.
 [2020-10-01 12:20 UTC] phpdocbot@php.net
Automatic comment on behalf of cmb
Revision: http://git.php.net/?p=doc/en.git;a=commit;h=7d4c08228e566afb3caedd29cb837a2fa67fcfbf
Log: Fix #80166: preg_match_all()'s PREG_OFFSET_CAPTURE ignoring modifier u
 [2020-10-01 12:20 UTC] phpdocbot@php.net
-Status: Verified +Status: Closed
 [2020-10-01 12:31 UTC] thomas at landauer dot at
> The use of character offsets should be avoided wherever possible.

Why?
When using *byte* offsets to split a string, you might end up with an invalid string, due to some "half"-characters.
 [2020-10-01 12:38 UTC] nikic@php.net
@thomas: If your pattern matches at a character boundary, then of course the returned byte offset will also always be located at a character boundary.

The only way you could end up with a byte offset that is not on a character boundary is if your pattern explicitly requests that by using a single code unit match (\C). If it does, the result would not even be representable with a character offset.
 [2020-10-01 12:58 UTC] thomas at landauer dot at
What I meant: My current code works with character numbers (`mb_substr()`, `mb_strlen()`). If I switched to byte numbers, I'd have to change this to `substr()` and `strlen()`; and if anything goes wrong there, I'm not just extracting some wrong characters, but rather completely *destroying* the entire string...

So why is it preferable to work with byte offsets?

And what's the point in having a dedicated modifier for UTF-8, if it doesn't make a difference in the end?
I think you should support this modifier here too, and leave the decision (byte vs. character offsets) to the user. I mean: This is exactly the point of such a switch, isn't it?
 [2020-10-01 13:11 UTC] cmb@php.net
First, all offsets in the PCRE extension are byte offsets.
Changing that would be a massive BC break.

Second, UTF-8 character offsets enforce sequential access to
characters and substrings, while byte offsets allow random access,
which is way faster.

If you still feel strongly that this should be changed, please
write a mail to the internals mailing list[1], since this
bugtracker is not suitable for this kind of discussion.

[1] <https://www.php.net/mailing-lists.php#internals>
 [2020-10-01 13:44 UTC] nikic@php.net
> And what's the point in having a dedicated modifier for UTF-8, if it doesn't make a difference in the end?
> I think you should support this modifier here too, and leave the decision (byte vs. character offsets) to the user. I mean: This is exactly the point of such a switch, isn't it?

The modifier controls how the pattern and input string are interpreted. You want /[äöü]/ to match the characters ä, ö and ü, not their constituent bytes in the UTF-8 encoding. It has no relation at all to the meaning of offsets, which are always proper byte offsets.
 [2020-10-01 14:25 UTC] phpdocbot@php.net
Automatic comment on behalf of mumumu
Revision: http://git.php.net/?p=doc/ja.git;a=commit;h=326a1d72dcef26f2461a23cf3b4897fab41f3375
Log: Fix #80166: preg_match_all()'s PREG_OFFSET_CAPTURE ignoring modifier u
 [2020-10-03 21:19 UTC] thomas at landauer dot at
Here's what I posted to the "php.internals" mailing list: https://news-web.php.net/php.internals/111983
 
PHP Copyright © 2001-2020 The PHP Group
All rights reserved.
Last updated: Mon Nov 23 20:01:23 2020 UTC