php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #33334 Matching explicitly excluded characters
Submitted: 2005-06-14 09:14 UTC Modified: 2005-06-16 13:07 UTC
From: kloske at tpg dot com dot au Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 4.3.10 OS: Linux
Private report: No CVE-ID: None
 [2005-06-14 09:14 UTC] kloske at tpg dot com dot au
Description:
------------
Whilst trying to get a > 600 character regular expression to correctly match input lines from a file I discovered some strange mismatching which at first I imagined was a bug in my regex string until I reduced it to the simple test case included below.

The test case shows some regex which should match limes that contain two fields, seperated by a comma. Each field is identical and can either be a string that does not start with a quote or a comma and contains no commas in it OR starts with a quote and ends with a quote and must contain only quotes or backslashes escaped with a preceeding backslash. Ie: Two fields which may only be simple strings or be c-style escaped strings seperated by a comma.

Note in my expected output I am making an educated guess as to what the actual output would be, some of the other fields printed might be a little different. The basics of the problem however are clearly demonstrated.

The final thing to note is that if I exclude quotes from the middle or end of the unquoted string case the problem vanishes. This leads me to suspect the problem is somehow related to regex's handling of quotes.

Even if there are problems with my regex (I am well aware it is not optimal or particularly "good" in any sense - be aware this is a cut down test case only) this example clearly demonstrates php's regex engine matching a string which contains characters which are clearly excluded in the pattern which it matches.

I've tested this with one field and it doesn't appear to be a problem there - it seems to only affect two fields one after another.

Reproduce code:
---------------
<?php

	$s = '"some text","test \",thing"';

	$r_text = "(\"(([^\\\"]|\\\\|\\\")*)\"|[^\",][^,]*)";
	
	$r_twofields = "${r_text},${r_text}";
	preg_match("/^${r_twofields}\$/", $s, $line);
	
	echo "<pre>";
	echo $s . "\n";
	echo $r_twofields . "\n";
	var_dump($line);
	echo "</pre>";

?>

Expected result:
----------------
"some text","test \",thing"
("(([^\"]|\\|\")*)"|[^",][^,]*),("(([^\"]|\\|\")*)"|[^",][^,]*)
array(5) {
  [0]=>
  string(27) ""some text","test \",thing""
  [1]=>
  string(20) ""some text","test \""
  [2]=>
  string(18) "some text","test \"
  [3]=>
  string(1) "\"
  [4]=>
  string(6) "thing""
}


Actual result:
--------------
"some text","test \", thing"
("(([^\"]|\\|\")*)"|[^",][^,]*),("(([^\"]|\\|\")*)"|[^",][^,]*)
array(5) {
  [0]=>
  string(28) ""some text","test \", thing""
  [1]=>
  string(20) ""some text","test \""
  [2]=>
  string(18) "some text","test \"
  [3]=>
  string(1) "\"
  [4]=>
  string(7) " thing""
}


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2005-06-14 09:17 UTC] kloske at tpg dot com dot au
Note that due to issues with the CAPTCHA, I've somehow included the wrong expected output and actual output.

The ACTUAL output is:
"some text","test \",thing"
("(([^\"]|\\|\")*)"|[^",][^,]*),("(([^\"]|\\|\")*)"|[^",][^,]*)
array(5) {
  [0]=>
  string(27) ""some text","test \",thing""
  [1]=>
  string(20) ""some text","test \""
  [2]=>
  string(18) "some text","test \"
  [3]=>
  string(1) "\"
  [4]=>
  string(6) "thing""
}

And the expected output is:
"some text","test \",thing"
("(([^\"]|\\|\")*)"|[^",][^,]*),("(([^\"]|\\|\")*)"|[^",][^,]*)
array(5) {
  [0]=>
  string(27) ""some text","test \",thing""
  [1]=>
  string(20) ""some text""
  [2]=>
  string(18) "some text"
  [3]=>
  string(1) "t"
  [4]=>
  string(6) ""test \", thing""
  [5]=>
  string(6) "test \", thing"
  [6]=>
  string(1) "g"
}

Sorry for the confusion.
 [2005-06-14 09:31 UTC] kloske at tpg dot com dot au
Hi,

Unfortunately the system this is running on at present is in production and I don't really have the resources just at this stage to get the latest stable snapshot up and running.

Perhaps someone with this stable snapshot can copy and paste the 10 or so short lines into a test.php webpage and see if it runs as expected or not?

If the reason you're asking me to do this is that you've tested it on the latest stable and it works then I will try as soon as I get time to check this, but otherwise I'll have to leave it a while as I have a lot of work on at the moment (buying a house, short staffed at work, serious spinal problems - the usual!)

As a slight aside, I should mention that I just tested it on another PHP box which is totally unrelated to the first, this time OpenBSD, PHP 4.1.2 and it is also affected.

I should have probably prefaced the report with the fact that I've got a workaround for my particular case which is an acceptable solution (just not accept strings which are unquoted and contain quotes).
 [2005-06-14 09:52 UTC] rasmus@php.net
Regular expressions are greedy by default.  Change it to:

$r_text = "(\"(([^\\\"]|\\\\|\\\")*?)\"|[^\",][^,]*?)";

or use the U modifier on the call and I bet it will do what you want.  There is no bug here.  
 [2005-06-14 12:20 UTC] kloske at tpg dot com dot au
Hi, strangely enough, you are correct that placing a question mark (for exactly 0 or 1 matches) works.

*however*, this opens up more questions than it answers (and to my mind brings to light perhaps deeper bugs). The regex manuals all have the following to say:

1. The behavior of multiple adjacent duplication symbols (+, *, ? and intervals) produces undefined results.

2. * matches zero or more occurrances, so ignoring (1), *? taken to mean what is most obvious means "zero or more repeated once or not at all" which definitely logically collapses down to "zero or more" which is what * means on its own, which is (a) what I had, and (b) logically equivalent to the suggested solution.

3. '/' and '"' NEVER (even when greedy) match ([^\"]|\\|\"), which my test case clearly demonstrates the PHP regular expression engine doing.

(1) would tend to suggest that *? as the correct way to achieve what I am after is undefined and therefore not correct.

(2) seems to indicate that failing (1), the two expressions should be equivalent and both produce the same behavior (which they clearly do not)

and

(3) cannot possibly be explained by ANY alternative solution since it clearly violates all possible ways of interpreting the regex.

Put simply: any sequence of characters generated from this regular expression ([^\"]|\\|\") can never contain a single backslash or a quote that is not proceeded by a backslash, yet the match that PHP's regular expression engine is returning violates this precondition.

I can see three possible situations occurring here:

1. PHP regex differs from the standard forms of regex available on POSIX systems, and whilst this may be desirable it needs to be clearly documented (which it currently is not - it is not even hinted at).

2. PHP regex has a bug with its handling of zero or more repetition generators.

3. There is something which I still am missing after repeated inspections, readings of the relevant manuals, and consultation with peers.
 [2005-06-14 12:23 UTC] kloske at tpg dot com dot au
I do not believe this bug to be bogus or resolved.
 [2005-06-14 16:46 UTC] sniper@php.net
It really is bogus: PHP uses the PCRE library underneath the preg_* functions. If there is any bug (IMO there is not bug), then it's in PCRE, so report this to the authors of that.

 [2005-06-14 17:35 UTC] rasmus@php.net
I have no idea what manuals you are reading or which peers you are talking to, but in perl-style regular expressions the '?' character is overloaded and has different meanings in different contexts.  Type "man perlre" at your Unix prompt and you will see:

       By default, a quantified subpattern is "greedy", that is, it will match
       as many times as possible (given a particular starting location) while
       still allowing the rest of the pattern to match.  If you want it to
       match the minimum number of times possible, follow the quantifier with
       a "?".

If you still don't understand this, take it up with the developers of the PCRE library over at http://pcre.org since that is the code PHP uses.  Even if somebody here agreed that there is a bug, it would have to be fixed by the PCRE folks.
 [2005-06-15 01:18 UTC] kloske at tpg dot com dot au
Thank you for that information - it is much appreciated. I will take this up with the PCRE people, as I still believe this to be incorrect behavior.

FYI, the documentation I was reading was the regex man pages on both solaris and linux. My peers were people who've studied regular expressions (as have I), and agreed that based on the definitions we've all seen in our respective studies (though none of us have studied PCRE specifically as an implementation) that the behavior we saw was a violation of matching conditions, as specified in the test case's regular expression.

ie: based on your greedy quote from the PCRE pages, I do not want it to match a minimum number of times, I want it to match as much as possible. Note the word possible; this regex did not allow it to match as much as it did - IE, it became very greedy indeed, to the point of matching text it wasn't allowed to!
 [2005-06-15 11:22 UTC] kloske at tpg dot com dot au
As a more simple test case, this literal text string:

"test","string\"

matches the folling REGEX pattern:

^"([^\"]|\\|\")*"$

Reversing the sense of REGEX to being a pattern GENERATOR, there is no way for that REGEX pattern to generate the string above.

I've reported this to the PCRE people and will keep you all posted as to the reply.
 [2005-06-15 12:03 UTC] kloske at tpg dot com dot au
Okay, the PCRE people have gotten back to me, and PCRE has proven to produce the correct expected behavior and my test case has not failed.

So now we're left with a test case which fails in PHP yet works on PCRE.

For a more stark example, consider the following PHP code:

$r = "/^\"([^\\\"]|\\\\|\\\")*\"\$/";
$s = "\"some text\",\"test \\\"";
preg_match($r, $s, $m);
var_dump($m);

$m should be empty, since $s does not match $r, yet the following is returned:

array(2) { [0]=> string(20) ""some text","test \"" [1]=> string(1) "\" } 

Note that the last element of the array contains a single backslash, indicating that the last choice that matched was a backslash, which is NOT ONE OF THE THREE CHOICES.

So, the PCRE people explained that they were not familiar with PHP but wondered if it is an escaping issue.

Does PHP require you to DOUBLE escape regex? ie, to match a sequence of two backslashes in a row, do you need to write "\\\\\\\\"? I've tried doing this and it seems to give the expected behavior, yet the manual does not mention this fact, and worse the user comments seem to indicate that you should not double escape (since no one is trying to do two backslashes in a row anywhere).

I'd say this is a documentation ~defficiency~ more than anything, since it should be made clear that you need to escape the string first, which then will need to be escaped again for correct interpretation by PCRE if you are trying to include a literal backslash, or in other situations where PCRE needs to escape things.

To recap, this is what you apparently need to write in PHP to match a 
literal of two backslashes next to each other:

"\\\\\\\\"

Gotta love it!

Because:

The number of backslashes are halved when PHP encodes it as a string, then 
it passes it literally to PCRE, which halves the number of backslashes 
again, to the final figure of two backslashes!

Simple when you understand, not even hinted at in the PHP documentation.
 [2005-06-15 18:58 UTC] rasmus@php.net
It would be a hell of a lot easier to read your regexes if you would use single quotes.  eg.

$r = '/^"([^\\"]|\\\\|\\")*"$/';
$s = '"some text","test \\"';
preg_match($r, $s, $m);
var_dump($m);

for your above example.  And this stuff is documented.


 [2005-06-16 13:02 UTC] kloske at tpg dot com dot au
Look I don't really care anymore one way or another because I've figured out now how it all works on a level that's detailed enough for me to understand correctly enough to write useful stable and correct code, but just for interest's sake, my regex used quotes because:

1. I needed other escapes and variables in there which single quotes will not allow, and the alternative was using lots of dot notation which looked uglier than using double quotes.

2. The documentation of which you speak, where this is apparently documented, http://au.php.net/preg_match, examples 1-3 (the only numbered examples on this documentation page) all use double quotes. As an aside, all three examples are wrong, or at best highly misleaing, since they use \b which inside a double quote escapes it before it ever gets to the PCRE code. I ran some tests today, and inside a double quote, its much more correct to use \\b instead of \b. Whilst it will work since PCRE is smarter than us, when it comes to \\ it won't, because PCRE is also more careful than us and assumes when it sees the resulting \ that we're trying to escape something.

3. I really really wanted to. Single or double quotes, regex is regex. I am sorry if I violated your preference. I should point out that regex is now 860+ characters long, so it ain't going to be easy to read in single or double quotes. I merely compressed it down and stuck with the format I was using.

In spite of all this, I couldn't find anywhere in the PHP doco's that they specifically mentioned the stuff about backspacing, and as I mentioned in point 2 above far from it they in fact mislead in their examples.
 [2005-06-16 13:07 UTC] kloske at tpg dot com dot au
Okay, found a page on the website which wasn't in my local docs:

http://php.planetmirror.com/manual/en/reference.pcre.pattern.syntax.php

It does mention the double quote thing. I stand corrected.

The other docs probably need a cleanup or something to fix the stuff I mentioned before.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri May 03 03:01:30 2024 UTC