php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #53309 Capturing group failing with a colon
Submitted: 2010-11-14 16:16 UTC Modified: 2010-11-15 02:02 UTC
From: michael at squiloople dot com Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 5.3.3 OS: Vista
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: michael at squiloople dot com
New email:
PHP Version: OS:

 

 [2010-11-14 16:16 UTC] michael at squiloople dot com
Description:
------------
In some circumstances, when a colon is a specified character in a capturing group, 
it unexpectedly fails.

Test script:
---------------
preg_match('/^(([a-z])(?::(?2))*)::(?:(?1):)[a-z]$/', 'a::a:a');
preg_match('/^(([a-z])(?::(?2))*)::(?:(?1)-)[a-z]$/', 'a::a-a');

Expected result:
----------------
int(1)
int(1)

Actual result:
--------------
int(0)
int(1)

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2010-11-14 22:19 UTC] felipe@php.net
-Status: Open +Status: Bogus
 [2010-11-14 22:19 UTC] felipe@php.net
This is a behavior of PCRE library.

PCRE manpages says:

       Like  recursive  subpatterns, a subroutine call is always treated as an
       atomic group. That is, once it has matched some of the subject  string,
       it  is  never  re-entered, even if it contains untried alternatives and
       there is a subsequent matching failure. Any capturing parentheses  that
       are  set  during  the  subroutine  call revert to their previous values
       afterwards.
 [2010-11-15 01:43 UTC] michael at squiloople dot com
I don't understand. The only difference between the two cases is that one has a 
colon after the backreference and the other has a dash. Look at it like this:

GROUP 1 :: COPY OF GROUP ONE :
GROUP 1 :: COPY OF GROUP ONE -

Why would the first fail but the second not? They should both work.
 [2010-11-15 01:55 UTC] felipe@php.net
You're comparing the regexes wrongly.

The dash version should be:
/^(([a-z])(?:-(?2))*)::(?:(?1)-)[a-z]$/

You can fix this by just not calling the subpattern (?1), but repeating the pattern or turning the quantifier * ungreedy, thus avoiding the atomic matching.

e.g.
/^(([a-z])(?::(?2))*)::(?:([a-z])(?::(?2))*:)[a-z]$/
/^(([a-z])(?::(?2))*?)::(?:(?1):)[a-z]$/
/^(([a-z])(?::(?2))*)::(?:(?1):)[a-z]$/U

When you does (?1), the PCRE internally is doing: (?>([a-z])(?::(?2))*)
which does the atomic matches, i.e. no backtracking will happens, that is needed to match your "a::a:a" string.
 [2010-11-15 02:02 UTC] felipe@php.net
-Package: Regexps related +Package: PCRE related
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sun Dec 22 01:01:30 2024 UTC