php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #53309 Capturing group failing with a colon
Submitted: 2010-11-14 16:16 UTC Modified: 2010-11-15 02:02 UTC
From: michael at squiloople dot com Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 5.3.3 OS: Vista
Private report: No CVE-ID: None
 [2010-11-14 16:16 UTC] michael at squiloople dot com
Description:
------------
In some circumstances, when a colon is a specified character in a capturing group, 
it unexpectedly fails.

Test script:
---------------
preg_match('/^(([a-z])(?::(?2))*)::(?:(?1):)[a-z]$/', 'a::a:a');
preg_match('/^(([a-z])(?::(?2))*)::(?:(?1)-)[a-z]$/', 'a::a-a');

Expected result:
----------------
int(1)
int(1)

Actual result:
--------------
int(0)
int(1)

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2010-11-14 22:19 UTC] felipe@php.net
-Status: Open +Status: Bogus
 [2010-11-14 22:19 UTC] felipe@php.net
This is a behavior of PCRE library.

PCRE manpages says:

       Like  recursive  subpatterns, a subroutine call is always treated as an
       atomic group. That is, once it has matched some of the subject  string,
       it  is  never  re-entered, even if it contains untried alternatives and
       there is a subsequent matching failure. Any capturing parentheses  that
       are  set  during  the  subroutine  call revert to their previous values
       afterwards.
 [2010-11-15 01:43 UTC] michael at squiloople dot com
I don't understand. The only difference between the two cases is that one has a 
colon after the backreference and the other has a dash. Look at it like this:

GROUP 1 :: COPY OF GROUP ONE :
GROUP 1 :: COPY OF GROUP ONE -

Why would the first fail but the second not? They should both work.
 [2010-11-15 01:55 UTC] felipe@php.net
You're comparing the regexes wrongly.

The dash version should be:
/^(([a-z])(?:-(?2))*)::(?:(?1)-)[a-z]$/

You can fix this by just not calling the subpattern (?1), but repeating the pattern or turning the quantifier * ungreedy, thus avoiding the atomic matching.

e.g.
/^(([a-z])(?::(?2))*)::(?:([a-z])(?::(?2))*:)[a-z]$/
/^(([a-z])(?::(?2))*?)::(?:(?1):)[a-z]$/
/^(([a-z])(?::(?2))*)::(?:(?1):)[a-z]$/U

When you does (?1), the PCRE internally is doing: (?>([a-z])(?::(?2))*)
which does the atomic matches, i.e. no backtracking will happens, that is needed to match your "a::a:a" string.
 [2010-11-15 02:02 UTC] felipe@php.net
-Package: Regexps related +Package: PCRE related
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Tue Apr 23 15:01:32 2024 UTC