php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #61780 Inconsistent PCRE captures in match results
Submitted: 2012-04-20 00:54 UTC Modified: 2017-01-06 10:28 UTC
Votes:2
Avg. Score:5.0 ± 0.0
Reproduced:2 of 2 (100.0%)
Same Version:0 (0.0%)
Same OS:0 (0.0%)
From: danielklein at airpost dot net Assigned: cmb (profile)
Status: Closed Package: PCRE related
PHP Version: 5.6.9 OS: *
Private report: No CVE-ID: None
View Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: danielklein at airpost dot net
New email:
PHP Version: OS:

 

 [2012-04-20 00:54 UTC] danielklein at airpost dot net
Description:
------------
Named and unnamed captures in both preg_match and preg_match_all (and probably preg_replace and the other PCRE functions too but I haven't tested them all) can capture the wrong number of parentheses if alternation or a zero-or-more quantifier is used.

If the pattern '/(?<b>b)|(?<c>c)|(?<d>d)/' is used to match 'c', both 'b' and 'c' will be set in the results array but 'd' won't be. 'b' should not be set (even to an empty string) as it failed to match anything. However, if it was trying to match '/(?<b>b?)(?<c>c)/' (note: optional 'b' AND mandatory 'c'), 'b' _should_ be set to '' as it's allowed to match a zero-length string. If a match gets tried but it fails and a capture later in the pattern works, the skipped capture should never produce a key in the results array. It should be OK to leave holes in the numbered sequence (e.g. match 0 and 2 but not 1).

Currently, you need to use PREG_OFFSET_CAPTURE and test to see if the key exists, and if it does, test to see if the capture position is -1. If this bug is fixed, capture positions will never be -1 as the key won't exist. Alternatively, an additional flag could be added (e.g. PREG_KEEP_NONMATCHES) to create keys for ALL captures whether used or not (so, in the first pattern above, keys would be created for 'b', 'c' and 'd' in all cases, and if matching the string 'c' the offsets for both 'b' and 'd' would be -1).

In summary, if the pattern '/(?<b>b)|(?<c>c)|(?<d>d)/' is used to match 'c', by default it should only ever create a key for 'c'. If desired, an additional flag could be added so that it creates keys for all captures: 'b', 'c' and 'd'. The current behaviour where it creates a key for 'b' and 'c' but not 'd' should be considered a bug and fixed.

Test script:
---------------
print('<pre>');
$offset = 0;
while (preg_match('/(?:(?<b>b)|(?<c>c)|(?<d>d))(?<e>e)?/', 'cdec', $matches, PREG_OFFSET_CAPTURE, $offset)) {
  $offset = $matches[0][1] + strlen($matches[0][0]);
  var_export($matches);
  print("\n\n");
}

print("****************\n\n");

preg_match_all('/(?:(?<b>b)|(?<c>c)|(?<d>d))(?<e>e)?/', 'cdec', $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER);
var_export($matches);
print('</pre>');


Expected result:
----------------
array (
  0 => 
  array (
    0 => 'c',
    1 => 0,
  ),
  'c' => 
  array (
    0 => 'c',
    1 => 0,
  ),
  2 => 
  array (
    0 => 'c',
    1 => 0,
  ),
)

array (
  0 => 
  array (
    0 => 'de',
    1 => 1,
  ),
  'd' => 
  array (
    0 => 'd',
    1 => 1,
  ),
  3 => 
  array (
    0 => 'd',
    1 => 1,
  ),
  'e' => 
  array (
    0 => 'e',
    1 => 2,
  ),
  4 => 
  array (
    0 => 'e',
    1 => 2,
  ),
)

array (
  0 => 
  array (
    0 => 'c',
    1 => 3,
  ),
  'c' => 
  array (
    0 => 'c',
    1 => 3,
  ),
  2 => 
  array (
    0 => 'c',
    1 => 3,
  ),
)

****************

array (
  0 => 
  array (
    0 => 
    array (
      0 => 'c',
      1 => 0,
    ),
    'c' => 
    array (
      0 => 'c',
      1 => 0,
    ),
    2 => 
    array (
      0 => 'c',
      1 => 0,
    ),
  ),
  1 => 
  array (
    0 => 
    array (
      0 => 'de',
      1 => 1,
    ),
    'd' => 
    array (
      0 => 'd',
      1 => 1,
    ),
    3 => 
    array (
      0 => 'd',
      1 => 1,
    ),
    'e' => 
    array (
      0 => 'e',
      1 => 2,
    ),
    4 => 
    array (
      0 => 'e',
      1 => 2,
    ),
  ),
  2 => 
  array (
    0 => 
    array (
      0 => 'c',
      1 => 3,
    ),
    'c' => 
    array (
      0 => 'c',
      1 => 3,
    ),
    2 => 
    array (
      0 => 'c',
      1 => 3,
    ),
  ),
)

Actual result:
--------------
array (
  0 => 
  array (
    0 => 'c',
    1 => 0,
  ),
  'b' => 
  array (
    0 => '',
    1 => -1,
  ),
  1 => 
  array (
    0 => '',
    1 => -1,
  ),
  'c' => 
  array (
    0 => 'c',
    1 => 0,
  ),
  2 => 
  array (
    0 => 'c',
    1 => 0,
  ),
)

array (
  0 => 
  array (
    0 => 'de',
    1 => 1,
  ),
  'b' => 
  array (
    0 => '',
    1 => -1,
  ),
  1 => 
  array (
    0 => '',
    1 => -1,
  ),
  'c' => 
  array (
    0 => '',
    1 => -1,
  ),
  2 => 
  array (
    0 => '',
    1 => -1,
  ),
  'd' => 
  array (
    0 => 'd',
    1 => 1,
  ),
  3 => 
  array (
    0 => 'd',
    1 => 1,
  ),
  'e' => 
  array (
    0 => 'e',
    1 => 2,
  ),
  4 => 
  array (
    0 => 'e',
    1 => 2,
  ),
)

array (
  0 => 
  array (
    0 => 'c',
    1 => 3,
  ),
  'b' => 
  array (
    0 => '',
    1 => -1,
  ),
  1 => 
  array (
    0 => '',
    1 => -1,
  ),
  'c' => 
  array (
    0 => 'c',
    1 => 3,
  ),
  2 => 
  array (
    0 => 'c',
    1 => 3,
  ),
)

****************

array (
  0 => 
  array (
    0 => 
    array (
      0 => 'c',
      1 => 0,
    ),
    'b' => 
    array (
      0 => '',
      1 => -1,
    ),
    1 => 
    array (
      0 => '',
      1 => -1,
    ),
    'c' => 
    array (
      0 => 'c',
      1 => 0,
    ),
    2 => 
    array (
      0 => 'c',
      1 => 0,
    ),
  ),
  1 => 
  array (
    0 => 
    array (
      0 => 'de',
      1 => 1,
    ),
    'b' => 
    array (
      0 => '',
      1 => -1,
    ),
    1 => 
    array (
      0 => '',
      1 => -1,
    ),
    'c' => 
    array (
      0 => '',
      1 => -1,
    ),
    2 => 
    array (
      0 => '',
      1 => -1,
    ),
    'd' => 
    array (
      0 => 'd',
      1 => 1,
    ),
    3 => 
    array (
      0 => 'd',
      1 => 1,
    ),
    'e' => 
    array (
      0 => 'e',
      1 => 2,
    ),
    4 => 
    array (
      0 => 'e',
      1 => 2,
    ),
  ),
  2 => 
  array (
    0 => 
    array (
      0 => 'c',
      1 => 3,
    ),
    'b' => 
    array (
      0 => '',
      1 => -1,
    ),
    1 => 
    array (
      0 => '',
      1 => -1,
    ),
    'c' => 
    array (
      0 => 'c',
      1 => 3,
    ),
    2 => 
    array (
      0 => 'c',
      1 => 3,
    ),
  ),
)

Patches

Pull Requests

Pull requests:

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2013-02-11 08:56 UTC] michael at mbaas dot de
Here is a reproduceable example (PHP 5.3.20 and 5.3.21) where named captures do 
not return matches at all! I've tested this pattern against the PCRE-
Implementation in another language and it worked...

<?php


$QQ=chr(92) . chr(34);
$delimeters = "{}";
$del0 = preg_quote($delimeters{0});
$del1 = preg_quote($delimeters{1});
$tag="language";
$string="fdfdfdfdf{language=1}testhgg";
$preg = "~" . $del0 . $tag . "\s*=\s*(?P<" . "quote>[" . $QQ . "\']*)(?
P<att>.*?)(?P=quote)\s*/" . $del1 . "~";
$match=array();
preg_match($preg,$string,$match);
echo "<br>string = " . htmlspecialchars($string) . "<br>preg=" . 
htmlspecialchars($preg) . "<br>match:<pre>";var_dump($match);echo"</pre>";


?>
 [2013-02-11 09:00 UTC] pajoye@php.net
Be sure that every testing environment uses the exact same version of PCRE. PHP 
only uses it and does not affect by any mean the parsing.
 [2013-02-11 09:00 UTC] pajoye@php.net
-Status: Open +Status: Feedback
 [2013-10-15 11:54 UTC] php-bugs at lists dot php dot net
No feedback was provided. The bug is being suspended because
we assume that you are no longer experiencing the problem.
If this is not the case and you are able to provide the
information that was requested earlier, please do so and
change the status of the bug back to "Re-Opened". Thank you.
 [2013-11-04 00:44 UTC] danielklein at airpost dot net
@michael: Your code has a bug in it. There is an extra slash "/" in your pattern. The code works as expected if you remove it.

@pajoye: This bug is still happening. It could be a bug in PCRE itself but I don't know how PHP gets its results from it.

By the way, I can't set this bug back to Re-Opened. It says: "ERROR:
You aren't allowed to change a bug to that state."
 [2015-05-23 08:42 UTC] danielklein at airpost dot net
-Status: No Feedback +Status: Open
 [2015-05-23 08:42 UTC] danielklein at airpost dot net
Output from pcretest:

  re> /(4)?(2)?\d/g
data> 123456
 0: 1
 0: 23
 1: <unset>
 2: 2
 0: 45
 1: 4
 0: 6
data>

Equivalent PHP code:
<?php
preg_match_all('/(4)?(2)?\d/', '123456', $matches, PREG_SET_ORDER);
var_export($matches);

/* Outputs:
array (
  0 => 
  array (
    0 => '1',
  ),
  1 => 
  array (
    0 => '23',
    1 => '',    // This key should not exist as it is unset, not blank in pcretest
    2 => '2',
  ),
  2 => 
  array (
    0 => '45',
    1 => '4',
  ),
  3 => 
  array (
    0 => '6',
  ),
)
*/
?>

pcretest for original report:
  re> /(?:(?<b>b)|(?<c>c)|(?<d>d))(?<e>e)?/
data> cdec
 0: c
 1: <unset>
 2: c
data>
  re> /(?:(?<b>b)|(?<c>c)|(?<d>d))(?<e>e)?/g
data> cdec
 0: c
 1: <unset>
 2: c
 0: de
 1: <unset>
 2: <unset>
 3: d
 4: e
 0: c
 1: <unset>
 2: c
data>
 2: c
data

Showing the difference between a blank capture and no capture (unset):
  re> /(a)?(b?)(c)?/
data> a
 0: a
 1: a
 2:
data> b
 0: b
 1: <unset>
 2: b
data> c
 0: c
 1: <unset>
 2:
 3: c
data>
 2:
 3: c
data
 [2015-05-23 12:12 UTC] cmb@php.net
-Status: Open +Status: Analyzed -Operating System: +Operating System: * -PHP Version: 5.4.0 +PHP Version: 5.6.9 -Assigned To: +Assigned To: cmb
 [2015-05-23 12:12 UTC] cmb@php.net
Verified: <http://3v4l.org/Qk9Pm>.

From the PCRE manuals[1]:

> When any of these functions encounter a substring that is unset,
> which can happen when capturing subpattern number n+1 matches
> some part of the subject, but subpattern n has not been used at
> all, they return an empty string. This can be distinguished from
> a genuine zero-length substring by inspecting the appropriate
> offset in ovector, which is negative for unset substrings.

This disambigution, however, is not done by PHP's implementation.
Instead the offsets (there are two for each subpattern: start and
end) are always subtracted, what yields 0 for unset offsets, and
so a string of length 0 (i.e. an empty string) is appended to the
resulting array.

Fixing the behavior to use NULL instead of an empty string doesn't
seem to be hard, but the change would constitute a BC break
(existing code actually might rely on a empty string). It might be
okay, however, to change the behavior for PHP 7.0.0.

[1] <http://www.pcre.org/original/doc/html/pcreapi.html#SEC18>
 [2015-05-23 18:56 UTC] cmb@php.net
-Assigned To: cmb +Assigned To:
 [2015-05-23 18:56 UTC] cmb@php.net
I have submitted a PR, where unmatched subpatterns will yield NULL
instead of an empty string. Besides being the smaller BC break
compared to skip the respective keys, that would be inline with
the results of pcretest, and in case of named subpatterns would
still allow to check with isset.
 [2017-01-06 10:28 UTC] cmb@php.net
-Status: Analyzed +Status: Closed -Assigned To: +Assigned To: cmb
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 10:01:29 2024 UTC