|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2005-06-14 09:14 UTC] kloske at tpg dot com dot au
Description:
------------
Whilst trying to get a > 600 character regular expression to correctly match input lines from a file I discovered some strange mismatching which at first I imagined was a bug in my regex string until I reduced it to the simple test case included below.
The test case shows some regex which should match limes that contain two fields, seperated by a comma. Each field is identical and can either be a string that does not start with a quote or a comma and contains no commas in it OR starts with a quote and ends with a quote and must contain only quotes or backslashes escaped with a preceeding backslash. Ie: Two fields which may only be simple strings or be c-style escaped strings seperated by a comma.
Note in my expected output I am making an educated guess as to what the actual output would be, some of the other fields printed might be a little different. The basics of the problem however are clearly demonstrated.
The final thing to note is that if I exclude quotes from the middle or end of the unquoted string case the problem vanishes. This leads me to suspect the problem is somehow related to regex's handling of quotes.
Even if there are problems with my regex (I am well aware it is not optimal or particularly "good" in any sense - be aware this is a cut down test case only) this example clearly demonstrates php's regex engine matching a string which contains characters which are clearly excluded in the pattern which it matches.
I've tested this with one field and it doesn't appear to be a problem there - it seems to only affect two fields one after another.
Reproduce code:
---------------
<?php
$s = '"some text","test \",thing"';
$r_text = "(\"(([^\\\"]|\\\\|\\\")*)\"|[^\",][^,]*)";
$r_twofields = "${r_text},${r_text}";
preg_match("/^${r_twofields}\$/", $s, $line);
echo "<pre>";
echo $s . "\n";
echo $r_twofields . "\n";
var_dump($line);
echo "</pre>";
?>
Expected result:
----------------
"some text","test \",thing"
("(([^\"]|\\|\")*)"|[^",][^,]*),("(([^\"]|\\|\")*)"|[^",][^,]*)
array(5) {
[0]=>
string(27) ""some text","test \",thing""
[1]=>
string(20) ""some text","test \""
[2]=>
string(18) "some text","test \"
[3]=>
string(1) "\"
[4]=>
string(6) "thing""
}
Actual result:
--------------
"some text","test \", thing"
("(([^\"]|\\|\")*)"|[^",][^,]*),("(([^\"]|\\|\")*)"|[^",][^,]*)
array(5) {
[0]=>
string(28) ""some text","test \", thing""
[1]=>
string(20) ""some text","test \""
[2]=>
string(18) "some text","test \"
[3]=>
string(1) "\"
[4]=>
string(7) " thing""
}
PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Wed Oct 29 21:00:01 2025 UTC |
Note that due to issues with the CAPTCHA, I've somehow included the wrong expected output and actual output. The ACTUAL output is: "some text","test \",thing" ("(([^\"]|\\|\")*)"|[^",][^,]*),("(([^\"]|\\|\")*)"|[^",][^,]*) array(5) { [0]=> string(27) ""some text","test \",thing"" [1]=> string(20) ""some text","test \"" [2]=> string(18) "some text","test \" [3]=> string(1) "\" [4]=> string(6) "thing"" } And the expected output is: "some text","test \",thing" ("(([^\"]|\\|\")*)"|[^",][^,]*),("(([^\"]|\\|\")*)"|[^",][^,]*) array(5) { [0]=> string(27) ""some text","test \",thing"" [1]=> string(20) ""some text"" [2]=> string(18) "some text" [3]=> string(1) "t" [4]=> string(6) ""test \", thing"" [5]=> string(6) "test \", thing" [6]=> string(1) "g" } Sorry for the confusion.I have no idea what manuals you are reading or which peers you are talking to, but in perl-style regular expressions the '?' character is overloaded and has different meanings in different contexts. Type "man perlre" at your Unix prompt and you will see: By default, a quantified subpattern is "greedy", that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match. If you want it to match the minimum number of times possible, follow the quantifier with a "?". If you still don't understand this, take it up with the developers of the PCRE library over at http://pcre.org since that is the code PHP uses. Even if somebody here agreed that there is a bug, it would have to be fixed by the PCRE folks.Okay, the PCRE people have gotten back to me, and PCRE has proven to produce the correct expected behavior and my test case has not failed. So now we're left with a test case which fails in PHP yet works on PCRE. For a more stark example, consider the following PHP code: $r = "/^\"([^\\\"]|\\\\|\\\")*\"\$/"; $s = "\"some text\",\"test \\\""; preg_match($r, $s, $m); var_dump($m); $m should be empty, since $s does not match $r, yet the following is returned: array(2) { [0]=> string(20) ""some text","test \"" [1]=> string(1) "\" } Note that the last element of the array contains a single backslash, indicating that the last choice that matched was a backslash, which is NOT ONE OF THE THREE CHOICES. So, the PCRE people explained that they were not familiar with PHP but wondered if it is an escaping issue. Does PHP require you to DOUBLE escape regex? ie, to match a sequence of two backslashes in a row, do you need to write "\\\\\\\\"? I've tried doing this and it seems to give the expected behavior, yet the manual does not mention this fact, and worse the user comments seem to indicate that you should not double escape (since no one is trying to do two backslashes in a row anywhere). I'd say this is a documentation ~defficiency~ more than anything, since it should be made clear that you need to escape the string first, which then will need to be escaped again for correct interpretation by PCRE if you are trying to include a literal backslash, or in other situations where PCRE needs to escape things. To recap, this is what you apparently need to write in PHP to match a literal of two backslashes next to each other: "\\\\\\\\" Gotta love it! Because: The number of backslashes are halved when PHP encodes it as a string, then it passes it literally to PCRE, which halves the number of backslashes again, to the final figure of two backslashes! Simple when you understand, not even hinted at in the PHP documentation.