php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #14893 RE starting with (.*) might break
Submitted: 2002-01-06 17:57 UTC Modified: 2002-01-27 04:42 UTC
Votes:1
Avg. Score:2.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:0 (0.0%)
Same OS:0 (0.0%)
From: japhy at pobox dot com Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 4.1.1 OS: SunOS
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: japhy at pobox dot com
New email:
PHP Version: OS:

 

 [2002-01-06 17:57 UTC] japhy at pobox dot com
Here's the problem:

<? echo preg_match('/(.*)\d+\1/', 'ab1b'); ?>

It fails, but it really shouldn't.  You can fool the engine into not having the bug:

<? echo preg_match('/(?=)(.*)\d+\1/', 'ab1b'); ?>

The bug is thus:  a regex that starts with .* can logically be made to start with an implicit anchor to the beginning of the string.  However, this optimization can break the success of a regex if the .* is captured (as above) and used later (the back-reference \1).  I've contacted the author of the PCRE package.

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2002-01-27 01:03 UTC] sterling@php.net
a) Not a PHP bug (but its good to be aware of this issue, if you wouldn't mind please send a mail to sterling@php.net with any updates, etc.)

b) not sure if this is really a bug, the way I read the 1st regex is:

read in ab
put that as \1
after a digit match \1
which is ab
after the digit there is only b

whereas in the second example you make the regex non-greedy, so therefore it matches from the beginning of the string and matches the ab from the lookahead assertion.

I could be wrong, but either way its not a PHP bug ;)  If you disagree please follow up at sterling@php.net

regards,
sterling
 [2002-01-27 04:42 UTC] japhy at pobox dot com
The bug is in PCRE, as the category states -- I am merely bringing to the attention of the PHP developers that this bug exists in the regex engine it employs.  I have contacted the author of PCRE, and he'll fix it when the next version of PCRE is released.

As for why it is properly a bug:

  "ab1b" =~ /(.*)\d+\1/

should match as follows (assuming absolutely no optimizations are done);

  []     [ab1b]  OPEN 1
  []     [ab1b]  STAR ANY
  [ab1b] []      CLOSE 1
  [ab1b] []      PLUS DIGIT
                   fail
  [ab1]  [b]     CLOSE 1
  [ab1]  [b]     PLUS DIGIT
                   fail
  [ab]   [1b]    CLOSE 1
  [ab]   [1b]    PLUS DIGIT
  [ab1]  [b]     REF 1
                   fail
  [a]    [b1b]   CLOSE 1
  [a]    [b1b]   PLUS DIGIT
                   fail
                   start over
  [a]    [b1b]   OPEN 1
  [a]    [b1b]   STAR ANY
  [ab1b] []      CLOSE 1
  [ab1b] []      PLUS DIGIT
                   fail
  [ab1]  [b]     CLOSE 1
  [ab1]  [b]     PLUS DIGIT
                   fail
  [ab]   [1b]    CLOSE 1
  [ab]   [1b]    PLUS DIGIT
  [ab1]  [b]     REF 1
  [ab1b] []      DONE

You can see that this regex should succeed (at least, I hope I've made that clear).  The problem is that the PCRE engine optimizes a .* at the beginning of a regex to be implicitly anchored with ^, since it seems obvious that if .* is going to match anywhere, it will end up matching at the beginning of the string.  This is perfectly sensible except in the case where that .* is captured and used later in the regex, as my case shows.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sun Dec 22 04:01:29 2024 UTC