php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #45372 hash# check in new re2c parser breaks code
Submitted: 2008-06-27 06:00 UTC Modified: 2008-07-08 15:23 UTC
From: alan_k@php.net Assigned: nlopess (profile)
Status: Closed Package: Scripting Engine problem
PHP Version: 5.3CVS-2008-06-27 (CVS) OS: linux
Private report: No CVE-ID: None
 [2008-06-27 06:00 UTC] alan_k@php.net
Description:
------------
single line file:

<?php if (1) { ?>#<?php }  ?>

produces a parse error:

get's caught with this rule from the re2c scanner.
<INITIAL>"#".+ {NEWLINE} {
	if ((YYCTYPE*)yytext == SCNG(yy_start)) {
		/* ignore first line when it's started with a # */
		goto restart;
	} else {
		goto inline_char_handler;
	}
}


basically the scanner runs off the end, and eats everything after the #

I've fixed it by changing the above to something like:
} else {
        /* shunt back to just return the # on it's own..   */
        YYCURSOR = YYMARKER;
          yyleng = 1;
        goto inline_char_handler;
}








Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2008-06-27 09:05 UTC] mattwil@php.net
(yyless(1) could just be used before the goto...)

Anyway, did you actually try that? AFAIK it still won't work, at least with your single line example (which there's already been at least one report about). While the local code fix is correct, the re2c code/logic seems flawed to me. (Maybe this bug report can be about that instead, in general, since I didn't get around to sending a follow-up message to the internals@ list yet, explaining things. :-))

In this example, it will still be broken because of the YYFILL() check -- each time it checks if the next character can match, even when it's at the end of the input. YYFILL() then makes it return, completely ignoring anything that has matched up to that point!

I'm not sure if this explanation is 100% correct, but I believe this wrong behavior happens when EOF is encountered while trying to match the variable length part of ANY rule; or something close to that. :-) It's been over a month since I tried to track and figure out what was happening. Granted, most of the cases (unlike yours), where the match is aborted because of YYFILL(), it's with invalid code, but it shouldn't happen. BTW, I think the part with the inline_char_handler label where it looks for opening PHP tags in the HTML, while a good optimization (using memchr() to find < etc.), was actually added as a workaround for this re2c/YYFILL() behavior. I didn't try it, but from what I've observed, I think whatever plain HTML was at the end of a file would have been lost if a regular rule (like in Flex) was used to match it...

Oh, there are also some more bugs in the code that looks for opening PHP tag, but they wouldn't be found as easily as this (and haven't been reported so far). I think I know how it can be fixed nicely, along with some more other scanner optimizations (for inline HTML and comments, basically). But I haven't done anything yet since some of it won't even work with these re2c/YYFILL() issues. :-/

Finally, to simplify what I think is the basic, underlying flaw with the code of re2c and YYFILL() now, here's a super easy example. Say you have one rule:

[a-z]+

It will NEVER match any input that a person would think, such as the string "foo" -- seems pretty messed up to me!?
 [2008-06-27 09:31 UTC] johannes@php.net
This should work like in older releases, Marcus please check it!
 [2008-06-27 11:26 UTC] felipe@php.net
Duplicated... Bug #45147
 [2008-06-27 14:57 UTC] alan_k@php.net
Not sure why re2c needs to deal with the #bang situation
looking at the code it would be better to eat that line outside of the lexer..


Something like:

int ini_lex(zval *ini_lval TSRMLS_DC)
{
     if ((YYCTYPE*)yytext == SCNG(yy_start) && *yych == '#') {
         while(*yych != '\n' && *yych != '\n' && yych < yyend) {
            yych++; 
         }
         while((*yych == '\n' || *yych == '\n') && yych < yyend) {
            yych++; 
         }
         YYCURSOR = yych;
     }
.....
 [2008-07-06 17:01 UTC] nlopess@php.net
This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.


 [2008-07-07 14:12 UTC] mattwil@php.net
This is not fixed, actually (is it OK to change Status back to Open?).

Nuno, I saw your commit yesterday, which didn't seem like it would help (as it wasn't related to what I said above), but wanted to wait until I could check again to make sure I wasn't crazy with my above description. :-)

I just tried the latest Windows snapshot and it's still generating a parse error with the *single line file* (no newline at the end, which .+ won't match, therefore won't trigger the YYFILL() "return 0" thing) descibed in this report (and the CLI example in Bug #44654). Can't be only broken on Windows since everything uses the same generated scanner code...

Something like Alan's scanning loop could be done after just matching #, BUT that's just another workaround for that underlying re2c/YYFILL() problem (also affecting other things). I believe if you use the tokenizer extension, you can see that if the last token of code is matched by a variable length rule, it won't be returned. e.g. my example of a simple rule, [a-z]+ not matching input "foo"
 [2008-07-08 13:28 UTC] jani@php.net
Nuno, fix not correct?
 [2008-07-08 15:23 UTC] nlopess@php.net
Ok, I think it is really fixed now. I even fixed other related bug.
Please test and let me know if you can still break it :-)
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 11:01:29 2024 UTC