php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #49333 Bug in recursive regex processing
Submitted: 2009-08-23 08:10 UTC Modified: 2009-11-23 17:49 UTC
From: laszlo dot janszky at gmail dot com Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 5.2.10 OS: Windows XP
Private report: No CVE-ID: None
View Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: laszlo dot janszky at gmail dot com
New email:
PHP Version: OS:

 

 [2009-08-23 08:10 UTC] laszlo dot janszky at gmail dot com
Description:
------------
I developed a recursive regex pattern for compile template patterns. During the tests I found this bug. I managed to restrict it to the following piece of code.
The count of the numbers, and every character (\n too) counts. So if I have 11+ characters long string in the 'y'-s block, then it's buggy, but by 10- character long strings it works fine.
I hope it's a real bug, and not a damage in my computer. :-)

Reproduce code:
---------------
$pattern='%.*?(?:([a-z])(?:.*?(?:(?R).*?)*?\1)?|$)%sD';
$test='
x
0123456789
x
y
01234567890
y';
preg_match_all($pattern,$test,$matches,PREG_SET_ORDER);
var_dump($matches);

Expected result:
----------------
array(3) { [0]=>  array(2) { [0]=>  string(18) " x 0123456789 x" [1]=>  string(1) "x" } [1]=>  array(2) { [0]=>  string(19) " y 01234567890 y" [1]=>  string(1) "y" } [2]=>  array(1) { [0]=>  string(0) "" } } 

Actual result:
--------------
array(0) { } 

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2009-08-23 10:54 UTC] sjoerd@php.net
Could not reproduce. When I run the code example you supplied, I get the expected result. Are you sure you have submitted the right code example?
 [2009-08-24 11:20 UTC] inf3rno dot hu at gmail dot com
Yes, I can reproduce it.
Tried with alternative text editor, but same result. So I think it's not memory or text editor problem. (Btw. I'll test my computer's memory soon.)
I'll try it out after a reinstall, maybe some dll files are damaged.
 [2009-08-24 12:21 UTC] inf3rno dot hu at gmail dot com
I reproduced it on another computer with the latest WAMPServer (Apache 2.2.11, PHP 5.3.0). I copied the code from here.
 [2009-08-25 08:35 UTC] jani@php.net
When the $test contains \r\n instead of \n it fails. 
 [2009-08-25 08:47 UTC] j dot boggiano at seld dot be
I am not entirely sure what you are trying to achieve so maybe I broke some functionality, but with this pattern at least it gives the expected result with \r\n or \n (of course \n has less chars..)

$pattern='%\s*(?:([a-z])(?:.*?(?:(?R).*?)*?\1)?|$)%sD';

Is that good enough ?
 [2009-08-25 10:05 UTC] inf3rno dot hu at gmail dot com
Original pattern was this:
'%(?<string>.*?)(?:{\\s*(?<function>[a-z0-9_]+)(?:\\s*(?:(?<hash>(?:(?:\\s+[a-z0-9_]+\\s*=\\s*)?(?:\\$[a-z0-9_]+(?:->[a-z0-9_]+|\\.[a-z0-9_]+)*|\\d+(?:\\.\\d+)?|".*?(?:\\\\".*?)*"))+)|(?<chain>(?:(?:\\s+[a-z0-9_]+(?: [a-z0-9_]+)*\\s+)?(?:\\$[a-z0-9_]+(?:->[a-z0-9_]+|\\.[a-z0-9_]+)*|\\d+(?:\\.\\d+)?|".*?(?:\\\\".*?)*"))+)|(?<list>(?:\\$[a-z0-9_]+(?:->[a-z0-9_]+|\\.[a-z0-9_]+)*|\\d+(?:\\.\\d+)?|".*?(?:\\\\".*?)*")(?:\\s*,\\s*(?:\\$[a-z0-9_]+(?:->[a-z0-9_]+|\\.[a-z0-9_]+)*|\\d+(?:\\.\\d+)?|".*?(?:\\\\".*?)*"))*)))?(?:\\s*}(?<block>.*?(?:(?R).*?)*?){\\s*/(?P=function))?\\s*}|{\\s*\\$(?<variable>[a-z0-9_]+(?:->[a-z0-9_]+|\\.[a-z0-9_]+)*)\\s*}|{\\s*\\*(?<comment>.*?)\\*\\s*}|$)%sDu'

This pattern matches on similar tokens like Smarty uses.

I need the %string_before(?:function_with_recursive_block|variable|comment|$)% structure because I have to capture the string before the token too, and the fastest way for that is this.
With offset capture and a %function_with_recursive_block|variable|comment% structured regex I can do this too, but it's the slower way, cause I have to call strlen and substr functions in a loop.

So I need that .*? :-)
But recursive patterns have a strange behavior.
I thought that '%.*?(?:([a-z])(?:(?R)*?\1)?|$)%sD' has to work too, but it didn't. Logically, the (?R)*? means here: "string+token...+string+end_of_the_recursive_part", but "$" is the end of the whole string, and not the end of the recursive part. :S
 [2009-08-25 10:25 UTC] jani@php.net
You can abuse things to some extend but there's a limit always. And this is not PHP bug anyway (if a bug at all) but PCRE lib.
 [2009-11-23 17:49 UTC] laszlo dot janszky at gmail dot com
This bug is in relation with the memory leak I found:
http://bugs.php.net/bug.php?id=50264

The code works with a raised pcre.backtrack_limit.
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Wed Jan 15 05:01:27 2025 UTC