php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #49333 Bug in recursive regex processing
Submitted: 2009-08-23 08:10 UTC Modified: 2009-11-23 17:49 UTC
From: laszlo dot janszky at gmail dot com Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 5.2.10 OS: Windows XP
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: laszlo dot janszky at gmail dot com
New email:
PHP Version: OS:

 

 [2009-08-23 08:10 UTC] laszlo dot janszky at gmail dot com
Description:
------------
I developed a recursive regex pattern for compile template patterns. During the tests I found this bug. I managed to restrict it to the following piece of code.
The count of the numbers, and every character (\n too) counts. So if I have 11+ characters long string in the 'y'-s block, then it's buggy, but by 10- character long strings it works fine.
I hope it's a real bug, and not a damage in my computer. :-)

Reproduce code:
---------------
$pattern='%.*?(?:([a-z])(?:.*?(?:(?R).*?)*?\1)?|$)%sD';
$test='
x
0123456789
x
y
01234567890
y';
preg_match_all($pattern,$test,$matches,PREG_SET_ORDER);
var_dump($matches);

Expected result:
----------------
array(3) { [0]=>  array(2) { [0]=>  string(18) " x 0123456789 x" [1]=>  string(1) "x" } [1]=>  array(2) { [0]=>  string(19) " y 01234567890 y" [1]=>  string(1) "y" } [2]=>  array(1) { [0]=>  string(0) "" } } 

Actual result:
--------------
array(0) { } 

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2009-08-23 10:54 UTC] sjoerd@php.net
Could not reproduce. When I run the code example you supplied, I get the expected result. Are you sure you have submitted the right code example?
 [2009-08-24 11:20 UTC] inf3rno dot hu at gmail dot com
Yes, I can reproduce it.
Tried with alternative text editor, but same result. So I think it's not memory or text editor problem. (Btw. I'll test my computer's memory soon.)
I'll try it out after a reinstall, maybe some dll files are damaged.
 [2009-08-24 12:21 UTC] inf3rno dot hu at gmail dot com
I reproduced it on another computer with the latest WAMPServer (Apache 2.2.11, PHP 5.3.0). I copied the code from here.
 [2009-08-25 08:35 UTC] jani@php.net
When the $test contains \r\n instead of \n it fails. 
 [2009-08-25 08:47 UTC] j dot boggiano at seld dot be
I am not entirely sure what you are trying to achieve so maybe I broke some functionality, but with this pattern at least it gives the expected result with \r\n or \n (of course \n has less chars..)

$pattern='%\s*(?:([a-z])(?:.*?(?:(?R).*?)*?\1)?|$)%sD';

Is that good enough ?
 [2009-08-25 10:05 UTC] inf3rno dot hu at gmail dot com
Original pattern was this:
'%(?<string>.*?)(?:{\\s*(?<function>[a-z0-9_]+)(?:\\s*(?:(?<hash>(?:(?:\\s+[a-z0-9_]+\\s*=\\s*)?(?:\\$[a-z0-9_]+(?:->[a-z0-9_]+|\\.[a-z0-9_]+)*|\\d+(?:\\.\\d+)?|".*?(?:\\\\".*?)*"))+)|(?<chain>(?:(?:\\s+[a-z0-9_]+(?: [a-z0-9_]+)*\\s+)?(?:\\$[a-z0-9_]+(?:->[a-z0-9_]+|\\.[a-z0-9_]+)*|\\d+(?:\\.\\d+)?|".*?(?:\\\\".*?)*"))+)|(?<list>(?:\\$[a-z0-9_]+(?:->[a-z0-9_]+|\\.[a-z0-9_]+)*|\\d+(?:\\.\\d+)?|".*?(?:\\\\".*?)*")(?:\\s*,\\s*(?:\\$[a-z0-9_]+(?:->[a-z0-9_]+|\\.[a-z0-9_]+)*|\\d+(?:\\.\\d+)?|".*?(?:\\\\".*?)*"))*)))?(?:\\s*}(?<block>.*?(?:(?R).*?)*?){\\s*/(?P=function))?\\s*}|{\\s*\\$(?<variable>[a-z0-9_]+(?:->[a-z0-9_]+|\\.[a-z0-9_]+)*)\\s*}|{\\s*\\*(?<comment>.*?)\\*\\s*}|$)%sDu'

This pattern matches on similar tokens like Smarty uses.

I need the %string_before(?:function_with_recursive_block|variable|comment|$)% structure because I have to capture the string before the token too, and the fastest way for that is this.
With offset capture and a %function_with_recursive_block|variable|comment% structured regex I can do this too, but it's the slower way, cause I have to call strlen and substr functions in a loop.

So I need that .*? :-)
But recursive patterns have a strange behavior.
I thought that '%.*?(?:([a-z])(?:(?R)*?\1)?|$)%sD' has to work too, but it didn't. Logically, the (?R)*? means here: "string+token...+string+end_of_the_recursive_part", but "$" is the end of the whole string, and not the end of the recursive part. :S
 [2009-08-25 10:25 UTC] jani@php.net
You can abuse things to some extend but there's a limit always. And this is not PHP bug anyway (if a bug at all) but PCRE lib.
 [2009-11-23 17:49 UTC] laszlo dot janszky at gmail dot com
This bug is in relation with the memory leak I found:
http://bugs.php.net/bug.php?id=50264

The code works with a raised pcre.backtrack_limit.
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Wed Jan 22 07:01:32 2025 UTC