php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #30618 INDEX POSITIONS OF A REGULAR EXPRESSION
Submitted: 2004-10-29 23:39 UTC Modified: 2004-10-30 16:48 UTC
From: webmaster at unitedscripters dot com Assigned:
Status: Not a bug Package: Regexps related
PHP Version: 5.0.2 OS: Windows XPP
Private report: No CVE-ID: None
View Add Comment Developer Edit
Anyone can comment on a bug. Have a simpler test case? Does it work for you on a different platform? Let us know!
Just going to say 'Me too!'? Don't clutter the database with that please !
Your email address:
MUST BE VALID
Solve the problem:
29 - 29 = ?
Subscribe to this entry?

 
 [2004-10-29 23:39 UTC] webmaster at unitedscripters dot com
Description:
------------
Object: FINDING INDEX POSITIONS OF A REGULAR EXPRESSION MATCH IS APPARENTLY A NON-AVAILABLE FEATURE

I might be wrong but apparently PHP lacks a way to spot not only matches but their _index_ positions within a string.

I at first thought that once found the matches by preg_match_all, all one had to do to draw also their index positions in the input string, was to iterate the returned array of matches and recursively grab any match from the string by strpos, removing the already inspected substring.

Though it may seem an obvious idea, yet it may not work.

The position in a string searched by a string oriented function is not necessarily the same poistion searched by a regular expression oriented function.

Consider this example, input string is:
"A thesaurus for the pupil"
whereas the regular expression searches for:
"/the\\b/"
which is obviusly a word like "the" followed by a word boundary (\\b).

The preg_match_all matches would report, correctly, only the isolated article "the", for that is followed by a word boundary.
But attempting to retrieve the index position of that match by strpos would report the index position of THEsaurus.

So do _not_ use strpos in combination with preg_match_all having in mind the retrieval of the index positions of the matches: that won't work the expected way.

Reproduce code:
---------------
function foo($string, $regexp){
$found=0;
$indexes=array();
preg_match_all($regexp, $string, $matches);
	print("<strong>".$matches[0][0]."</strong>");
$matchSize=sizeof($matches[0]);
for($m=0; $m < $matchSize; $m++){
$found=strlen(substr($string, 0, $found));
preg_match($regexp, $string, $specificMatch, PREG_OFFSET_CAPTURE, $found);
$indexes[$m]=$found+
strpos(substr($string, $found), $specificMatch[0][0]);/*shortcoming: it's not a real index*/
$found=$indexes[$m]+strlen($matches[$m]);
};
return $indexes;
}

$in="A thesaurus for the pupil";
print "In string <strong>$in</strong>, match is: ";
$out=foo($in, "/the\\b/");
print "<br>Wrong Index reported: ";
print_r($out);

Expected result:
----------------
The result is correct, it is the feature that we lack and that _apparently_ we cannot even implement: grabbing the correct index of a Regular Expression match.
Whatever the case, the feature is needed: javascript has it, the regular expression oriented function named search(), which reports at least one index and thus can be used recursively on gradually shrinking substrings of the input string to retrieve the positions of all the matches.

If there is a way and I was not aware of it, I apologize. Yet the list of perl regexps clearly lacks a function for the retrieval of the indexes.


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2004-10-30 14:27 UTC] webmaster at unitedscripters dot com
Setting the PREG_OFFSET_CAPTURE flag in preg_match_all, does that.

I apologize for the wrong submission of an alleged missing feature.

Luckly enough, this is the only wrong submission I sent. I'll be more careful in the future when I report something at around 5am italian time after 15 hours of coding!
 [2004-10-30 16:48 UTC] derick@php.net
User error, so we mark the bug as bogus.
 
PHP Copyright © 2001-2022 The PHP Group
All rights reserved.
Last updated: Mon Jul 04 03:03:50 2022 UTC