php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #73948 Preg_match_all should return NULLs on trailing optional capture groups.
Submitted: 2017-01-16 14:24 UTC Modified: 2019-03-19 14:39 UTC
Votes:3
Avg. Score:4.7 ± 0.5
Reproduced:2 of 2 (100.0%)
Same Version:1 (50.0%)
Same OS:0 (0.0%)
From: tomasyorke at hotmail dot com Assigned: nikic (profile)
Status: Closed Package: PCRE related
PHP Version: 7.0.14 OS: Windows 7
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: tomasyorke at hotmail dot com
New email:
PHP Version: OS:

 

 [2017-01-16 14:24 UTC] tomasyorke at hotmail dot com
Description:
------------
Using preg_match_all with the PREG_SET_ORDER flag and an optional capture group might return either an empty string or a missing element. 

This depends on whether there is a matched capture group after the non-matched capture group.

From an interface perspective, this means that I have to check for two representations to test whether an optional capture group was matched.

The test script demonstrates such a case.

When considering a fix, the first priority should be that whatever the representation is (NULL, an empty string or a missing element.),  it should be consistent, no matter if there is a matched capture group after or not.

If this cannot be fixed due to backwards compatibility Issues. Could we add a PREG_KEEP_NONMATCHES or PREG_SET_ORDER_2 flag?


Test script:
---------------
<?php

preg_match_all("#(a)?(b)(c)?#","b",$matches,PREG_SET_ORDER);

var_dump($matches);

Expected result:
----------------
array(1) {
  [0] => array(3) {
    [0] => string(1) "b" [1] => string(0) "" [2] => string(1) "b" [3] => string(0) ""
  }
}

Actual result:
--------------
array(1) {
  [0] => array(3) {
    [0] => string(1) "b" [1] => string(0) "" [2] => string(1) "b"
  }
}

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2017-01-16 14:38 UTC] requinix@php.net
-Status: Open +Status: Duplicate -Package: *Regular Expressions +Package: PCRE related
 [2017-01-16 14:38 UTC] requinix@php.net
Duplicate of bug #61780, fixed in master. (unmatched groups will be NULL)
 [2017-01-16 14:40 UTC] nikic@php.net
Not sure if it's quite a duplicate. IIRC we now use null instead of "" if unmatched, but we still don't fill up null values at the end.
 [2017-01-16 14:57 UTC] requinix@php.net
-Status: Duplicate +Status: Feedback
 [2017-01-16 14:57 UTC] requinix@php.net
Hmm, yes, I misunderstood this report: 61780 is about turning empty strings from unmatched subpatterns into NULLs while this is about *adding* trailing unmatched subpatterns.
I don't see anything in the 61780 changes that clearly indicate this is also fixed, but some parts look like they might do it accidentally. Can anyone check? (It'd be nice if 3v4l could do master too...)
 [2017-01-16 15:50 UTC] requinix@php.net
-Summary: Preg_match_all should return empty strings on optional capture groups. +Summary: Preg_match_all should return NULLs on trailing optional capture groups. -Status: Feedback +Status: Open
 [2017-01-16 15:50 UTC] requinix@php.net
Actually I can. Yay for Linux on Windows!

~/php/master/bin# ./php
<?php var_dump(preg_match('/(a)?(b)?(c)?/', 'b', $matches), $matches);
int(1)
array(3) {
  [0]=>
  string(1) "b"
  [1]=>
  NULL
  [2]=>
  string(1) "b"
}

[1] is clearly NULL but there is no [3] for the unmatched (c)? group.
 [2017-01-17 13:28 UTC] cmb@php.net
-Status: Open +Status: Analyzed
 [2017-01-17 13:28 UTC] cmb@php.net
For PREG_PATTERN_ORDER unmatched subpatterns are already added[1];
something like that would also have to be done for PREG_SET_ORDER.

However, that would cause BC issues, and adding another flag isn't
appealing to me. I also think that NULL and unset are close enough
to stick with the current behavior for now; checking `isset()`
would suffice.

[1] <https://github.com/php/php-src/blob/PHP-7.1.0/ext/pcre/php_pcre.c#L862-L871>
 [2017-01-23 15:48 UTC] tomasyorke at hotmail dot com
As noted by cmb, this bug does not appear with the PREG_PATTERN_ORDER flag.
Thus I switched the flag and my code to use PREG_PATTERN_ORDER and implemented my own preg_set_order() function.
But I found a nearly identical bug with the PREG_OFFSET_CAPTURE flag.

Consider the code

  preg_match_all("#(a)?(b)(c)?#","b",$matches,PREG_OFFSET_CAPTURE);
  var_dump($matches);

Returns

  array(4) {
    [0] => array(1) {
      [0] => array(2) {
        [0] => string(1) "b" [1] => int(0)
      }
    } [1] => array(1) {
      [0] => array(2) {
        [0] => string(0) "" [1] => int(-1)
      }
    } [2] => array(1) {
      [0] => array(2) {
        [0] => string(1) "b" [1] => int(0)
      }
    } [3] => array(1) {
      [0] => string(0) ""
    }
  }

As you can see, the third optional capture group returns something different than the first optional capture group.
This is too a bug. Again, I don't care what representation is chosen. But it must be the same for any optional capture no matter if trailing or not.
 [2017-01-23 15:59 UTC] tomasyorke at hotmail dot com
Regarding BC
The proposed change would be to change the representation of missed capture groups  from 2 representations to 1 representation.
The only code that would break is code that supported only the more obscure representation that only appears with trailing groups. And if that code exists, then it will still break on non trailing missing groups even without the change.
 [2017-01-23 17:17 UTC] cmb@php.net
Regarding the BC break: when fixing <https://bugs.php.net/61780> I
would have preferred to unset unmatched captures, but that
appeared too much of a BC break (consider somebody is count()ing
the $matches). Therefore I changed the empty strings to NULL.

Adding unmatched trailing captures as NULL would still have the
same BC concerns – I don't think we can do that before PHP 7.2,
and even that might require the RFC process.
 [2017-01-23 17:43 UTC] tomasyorke at hotmail dot com
I'm glad we didn't go the unset route. Having fixed sized arrays is useful for looping by groups/matches independently of the Flag used.
You also can very easily know how many groups/matches the regex returned/contained.
By making the output array dynamicly sized. You only allow for an easy way to determine one of these things. While making the other more complex.
 [2017-01-23 17:53 UTC] tomasyorke at hotmail dot com
If you wish to delay this change until 7.2, that sounds fine.
But why would you need an RFC for a bugfix?
 [2017-01-23 18:31 UTC] cmb@php.net
> But why would you need an RFC for a bugfix?

An RFC might be over the top, but at least a PR to get some
attention appears to be appropriate.
 [2018-04-14 14:42 UTC] cmb@php.net
Since we now have PREG_UNMATCHED_AS_NULL, the solution appears to
be straight forward: fill up NULL values at the end, only if
PREG_UNMATCHED_AS_NULL is used.  Since this option is only
available as of PHP 7.2.0, the resulting BC break seems to be
acceptable.
 [2018-04-14 14:47 UTC] requinix@php.net
The filling should happen with or without PREG_UNMATCHED_AS_NULL. What it should control whether the filler is NULLs or empty strings.
 [2018-04-14 15:08 UTC] nikic@php.net
@requinix: That's impossible for BC reasons. I agree with @cmb that this should only be done for PREG_UNMATCHED_AS_NULL.
 [2018-06-09 22:51 UTC] php at tecnopolis dot ca
For those stuck on older PHP's, I'm wondering if the following is sufficient to work around this bug?

$pattern.='(.??)';
preg_match($pattern,$haystack,$matches);
array_pop($matches);

I can't yet find any instance where the fairly innocuous (.??) could possibly cause a problem, it should match in all cases, when there's chars left or not, and being non-greedy will not eat earlier matches' chars.
 [2019-03-19 13:04 UTC] nikic@php.net
-Status: Analyzed +Status: Assigned -Assigned To: +Assigned To: nikic
 [2019-03-19 13:04 UTC] nikic@php.net
PR for PHP 7.4: https://github.com/php/php-src/pull/3964
 [2019-03-19 14:39 UTC] nikic@php.net
> But I found a nearly identical bug with the PREG_OFFSET_CAPTURE flag.

Fixed this issue with https://github.com/php/php-src/commit/f53e7394eb61085c09928b47f84123bd90b64dfb on the PHP 7.4 branch.
 [2019-03-21 09:23 UTC] nikic@php.net
Automatic comment on behalf of nikita.ppv@gmail.com
Revision: http://git.php.net/?p=php-src.git;a=commit;h=6311581ac64372164de7ba24f086eb3b0b91eabb
Log: Fix bug #73948
 [2019-03-21 09:23 UTC] nikic@php.net
-Status: Assigned +Status: Closed
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 13:01:29 2024 UTC