php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #14893 RE starting with (.*) might break
Submitted: 2002-01-06 17:57 UTC Modified: 2002-01-27 04:42 UTC
Votes:1
Avg. Score:2.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:0 (0.0%)
Same OS:0 (0.0%)
From: japhy at pobox dot com Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 4.1.1 OS: SunOS
Private report: No CVE-ID: None
 [2002-01-06 17:57 UTC] japhy at pobox dot com
Here's the problem:

<? echo preg_match('/(.*)\d+\1/', 'ab1b'); ?>

It fails, but it really shouldn't.  You can fool the engine into not having the bug:

<? echo preg_match('/(?=)(.*)\d+\1/', 'ab1b'); ?>

The bug is thus:  a regex that starts with .* can logically be made to start with an implicit anchor to the beginning of the string.  However, this optimization can break the success of a regex if the .* is captured (as above) and used later (the back-reference \1).  I've contacted the author of the PCRE package.

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2002-01-27 01:03 UTC] sterling@php.net
a) Not a PHP bug (but its good to be aware of this issue, if you wouldn't mind please send a mail to sterling@php.net with any updates, etc.)

b) not sure if this is really a bug, the way I read the 1st regex is:

read in ab
put that as \1
after a digit match \1
which is ab
after the digit there is only b

whereas in the second example you make the regex non-greedy, so therefore it matches from the beginning of the string and matches the ab from the lookahead assertion.

I could be wrong, but either way its not a PHP bug ;)  If you disagree please follow up at sterling@php.net

regards,
sterling
 [2002-01-27 04:42 UTC] japhy at pobox dot com
The bug is in PCRE, as the category states -- I am merely bringing to the attention of the PHP developers that this bug exists in the regex engine it employs.  I have contacted the author of PCRE, and he'll fix it when the next version of PCRE is released.

As for why it is properly a bug:

  "ab1b" =~ /(.*)\d+\1/

should match as follows (assuming absolutely no optimizations are done);

  []     [ab1b]  OPEN 1
  []     [ab1b]  STAR ANY
  [ab1b] []      CLOSE 1
  [ab1b] []      PLUS DIGIT
                   fail
  [ab1]  [b]     CLOSE 1
  [ab1]  [b]     PLUS DIGIT
                   fail
  [ab]   [1b]    CLOSE 1
  [ab]   [1b]    PLUS DIGIT
  [ab1]  [b]     REF 1
                   fail
  [a]    [b1b]   CLOSE 1
  [a]    [b1b]   PLUS DIGIT
                   fail
                   start over
  [a]    [b1b]   OPEN 1
  [a]    [b1b]   STAR ANY
  [ab1b] []      CLOSE 1
  [ab1b] []      PLUS DIGIT
                   fail
  [ab1]  [b]     CLOSE 1
  [ab1]  [b]     PLUS DIGIT
                   fail
  [ab]   [1b]    CLOSE 1
  [ab]   [1b]    PLUS DIGIT
  [ab1]  [b]     REF 1
  [ab1b] []      DONE

You can see that this regex should succeed (at least, I hope I've made that clear).  The problem is that the PCRE engine optimizes a .* at the beginning of a regex to be implicitly anchored with ^, since it seems obvious that if .* is going to match anywhere, it will end up matching at the beginning of the string.  This is perfectly sensible except in the case where that .* is captured and used later in the regex, as my case shows.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Wed Oct 30 22:01:28 2024 UTC