php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #51531 Adding additional backreferencing indicators for use with PREG_OFFSET_CAPTURE
Submitted: 2010-04-11 03:43 UTC Modified: 2016-08-20 10:27 UTC
From: mrjminer at gmail dot com Assigned:
Status: Open Package: PCRE related
PHP Version: Irrelevant OS: All (AFAIK)
Private report: No CVE-ID: None
Have you experienced this issue?
Rate the importance of this bug to you:

 [2010-04-11 03:43 UTC] mrjminer at gmail dot com
Description:
------------
This suggestion is related to PREG_MATCH_ALL when using PREG_OFFSET_CAPTURE.

When specifying PREG_OFFSET_CAPTURE as a flag, each subpattern matched results in the return of the subpatterned matched and the offset of the subpattern matched in the $matches array.  Yet, there are instances where I may only need one of these pieces of information for a particular subpattern match, but want the other piece (or both pieces) of information for a different particular subpattern match within the expression.  In these instances, resources are being unnecessarily wasted to store undesired information in the $matches array.

My suggestion is to add two additional indicators for backreference capturing that can be used when the PREG_OFFSET_CAPTURE flags is specified.  These indicators would tell the engine to set the results of either the offset or the subpattern string in the $matches array to null.  I believe this change would reduce the space required to hold the information in $matches, while extending the typical functional use of PREG_MATCH_ALL when used with PREG_OFFSET_CAPTURE (the same could also be done for PREG_SPLIT and PREG_SPLIT_OFFSET_CAPTURE)

Test script:
---------------
Take, for instance, the following preg_match_all expressions to match opening tags of BBCode:

1.
preg_match_all('/\\[(B|I|U|URL|COLOR|SIZE|LIST)(?:=([^]]*?))?](?=\\s*?[^\\s])/iu',$bbc,$openers,PREG_SET_ORDER|PREG_OFFSET_CAPTURE);
foreach($openers as $key => $val) {
	foreach($val as $key2 => $val2) {
		foreach($val2 as $key3 => $val3) {
			echo '$openers['.$key.']['.$key2.']['.$key3.'] = '.$val3.'<br>';
		}
	}
}

2.
preg_match_all('/\\[(B|I|U|URL|COLOR|SIZE|LIST)(?:=([^]]*?))?](?=(\\s*?[^\\s]))/iu',$bbc,$openers,PREG_SET_ORDER|PREG_OFFSET_CAPTURE);
foreach($openers as $key => $val) {
	foreach($val as $key2 => $val2) {
		foreach($val2 as $key3 => $val3) {
			echo '$openers['.$key.']['.$key2.']['.$key3.'] = '.$val3.'<br>';
		}
	}
}

Expected result:
----------------
In expression 1, the subpattern '(?=\\s*?[^\\s])' is used to check for basic validity of an opening tag.  The beginning of the contents of the opening tag would have to be found using the offset of the whole match ($matches[#][0][1]) plus the length of the whole match ($matches[#][0][0]):  $matches[#][0][1] + strlen($matches[#][0][0]) = $contentstartposition.

In expression 2, the subpattern '(?=(\\s*?[^\\s]))' is used to check for basic validity of an opening tag AND capture the position of where the content starts in order to prevent performing a mathematical equation and a strlen in order to find the starting position of the content:  $matches[#][3][1] = $contentstartposition.

In terms of processing power involved, expression 2 is superior to expression 1, as it is merely relaying information already gathered and known by the engine instead of performing addition and a strlen().  However, in terms of the resources required to store the match information, expression 1 is superior to expression 2 and still ensures a valid tag is found (but will require additional processing to get a piece of information returned by expression 2).

The commonalities among both of these expressions:
-Neither requires the offsets for subpattern [1] or [2], merely the contents of it (for parsing / filtering).  The offsets are returned at the expense of memory resources to store these unneeded offsets.  The only other alternative to obtaining only the contents of the match without using the memory is to spend significant processing resources to parse for the same contents the subpattern match returns in $matches.
-Neither requires the contents of the last subpattern (captured or not) -- the offset is the only desired portion.  In expression 1, the offset must be attained by comprimising processing resources; in expression 2, the offset is attained by comprimising memory resources.

If there were additional indicators to restrict the returned value in $matches for each subpattern, the $matches array returned could require substantially less resources to store, while retaining its current functionality and adding functionality to situations where it would not be feasible to comprimise an increased use of memory resources for a decreased use of CPU resources.

Thanks for your time!


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2010-04-11 03:51 UTC] mrjminer at gmail dot com
By the way, "backreferencing indicator" is not a technical term, as far as I know.  I mean something along the lines of how '?:' indicates no backreference should be captured.

Thanks for reading!
 [2016-08-20 10:27 UTC] cmb@php.net
-Package: *Regular Expressions +Package: PCRE related
 
PHP Copyright © 2001-2019 The PHP Group
All rights reserved.
Last updated: Thu Mar 21 06:01:28 2019 UTC