php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #54089 token_get_all with regards to __halt_compiler is not binary safe
Submitted: 2011-02-24 13:16 UTC Modified: 2011-11-08 09:52 UTC
From: nicolas dot grekas+php at gmail dot com Assigned: iliaa (profile)
Status: Closed Package: Unknown/Other Function
PHP Version: 5.3.5 OS: Any
Private report: No CVE-ID: None
 [2011-02-24 13:16 UTC] nicolas dot grekas+php at gmail dot com
Description:
------------
A. token_get_all() eats some characters which are not allowed in plain PHP code and trigger a "Unexpected character in input" warning.

B. after a T_HALT_COMPILER, the tokens are still identified as if they were not after this T_HALT_COMPILER, when in reality they are just random data.

So, when using token_get_all on code which contains a T_HALT_COMPILER, the data after that is corrupted because of A.

Test script:
---------------
<?php

$code = "<?php __halt_compiler();\x01?>\x02";

$tokens = token_get_all($code);
$reconstructed_code = '';

foreach ($tokens as $t)
{
	$reconstructed_code .= isset($t[1]) ? $t[1] : $t;
}

var_dump($code);
var_dump($reconstructed_code);


Expected result:
----------------
string(28) "<?php __halt_compiler();?>"
string(28) "<?php __halt_compiler();?>"


Actual result:
--------------
PHP Warning:  Unexpected character in input:  '' (ASCII=1) state=0 on line 5
string(28) "<?php __halt_compiler();?>"
string(27) "<?php __halt_compiler();?>"


Patches

tokenizer_patch_full.txt (last revision 2011-09-15 14:32 UTC by nikic@php.net)
tokenizer_patch.txt (last revision 2011-09-15 14:27 UTC by nikic@php.net)

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2011-02-28 16:18 UTC] iliaa@php.net
Automatic comment from SVN on behalf of iliaa
Revision: http://svn.php.net/viewvc/?view=revision&amp;revision=308761
Log: Fixed bug #54089 (token_get_all() does not stop after __halt_compiler).
 [2011-02-28 16:18 UTC] iliaa@php.net
-Status: Open +Status: Closed -Assigned To: +Assigned To: iliaa
 [2011-02-28 16:18 UTC] iliaa@php.net
This bug has been fixed in SVN.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.


 [2011-03-01 10:15 UTC] nicolas dot grekas+php at gmail dot com
Thanks for the patch. After reading it, I'm not sure it really helps, considering that the stop on T_HALT_COMPILER was already easily feasible in plain PHP. In fact, it may be worse, because now if I want to access data after T_HALT_COMPILER in PHP using tokenizer, I have to write more code, as the data is missing from the token array.

As a corner case also, __halt_compiler is always followed by 3 valid tokens: "(", ")" then ";" or T_CLOSE_TAG, with any number of T_WHITESPACE/T_COMMENT/T_DOC_COMMENT between.

My view is that this "bug" can be fixed by introducing a new T_UNEXPECTED_CHARACTER token type, matching those "Unexpected character in input" warnings: this would fix token_get_all binary unsafeness. Is it a good idea? I don't know if it's difficult to implement, nor if it would introduce any BC break, so maybe a "Won't fix" on this bug is enough?

Could the patch be reverted? I'm afraid it's the best for tokenizer users...
Here is what I was using before the patch to work around this binary incompatibility:

<?php

// New token matching an "Unexpected character in input"
define('T_UNEXPECTED_CHARACTER', -1); 

$src_tokens = @token_get_all($code);
$bin_tokens = array();
$offset =  0;
$i      = -1;

while (isset($src_tokens[++$i]))
{
	$t = isset($src_tokens[$i][1]) ? $src_tokens[$i][1] : $src_tokens[$i];

	while ($t[0] !== $code[$offset])
		$bin_tokens[] = array(T_UNEXPECTED_CHARACTER, $code[$offset++]);

	$offset += strlen($t);
	$bin_tokens[] = $src_tokens[$i];
	unset($src_tokens[$i]);
}

// Here, $bin_tokens contains binary safe tokens

?>
 [2011-03-03 15:44 UTC] nicolas dot grekas+php at gmail dot com
-Status: Closed +Status: Assigned
 [2011-03-03 15:44 UTC] nicolas dot grekas+php at gmail dot com
Sorry to reopen. As 5.3.6 is in RC, I just want to be sure my previous comment has been read. What about reverting the patch ?
 [2011-03-10 08:26 UTC] nicolas dot grekas+php at gmail dot com
-Status: Assigned +Status: Open
 [2011-03-10 08:26 UTC] nicolas dot grekas+php at gmail dot com
Really, the actual patch is a step backward, I can't do things that were easy before (getting the halt_compiler_offset with token_get_all)...
Please consider reverting it!
 [2011-03-17 23:51 UTC] nicolas dot grekas+php at gmail dot com
-Status: Open +Status: Closed
 [2011-03-17 23:51 UTC] nicolas dot grekas+php at gmail dot com
Latest 5.3.6 has been released with the patch...
So now token_get_all() stops on T_HALT_COMPILER for ever :)
 [2011-09-13 07:49 UTC] nicolas dot grekas+php at gmail dot com
-Status: Closed +Status: Assigned
 [2011-09-13 07:49 UTC] nicolas dot grekas+php at gmail dot com
Excerpt from internals:

Nikita Popov 	Fri, Sep 9, 2011 at 09:15

[...]
The problem with the patch is, that there are some tokens after
T_HALT_COMPILER that are of interest, namely the '(' ')' ';'. After
the patch it is impossible to get those tokens, without either
relexing the code after T_HALT_COMPILER (that way you get the binary
data problem back, just with much more complex code) or writing a
regular expression to match it (which is really hard, as there may be
any token dropped by the PHP parser in there, i.e. whitespace,
comments, PHP tags).

[...]
I would like this patch to be reverted on the 5.4 and trunk branches.
I assume it's too late to revert it on 5.3, as it has been there for
some time already. It is just counterproductive. (Alternatively one
could fix token_get_all to return the (); tokens after
__halt_compiler, too, but that would be hard, probably.)

--

Ferenc Kovacs 	Fri, Sep 9, 2011 at 10:01

I think that it wouldn't be too hard.
From a quick glance on the code, we should introduce a new local
variable, set that to true where we break now (
http://svn.php.net/viewvc/php/php-src/trunk/ext/tokenizer/tokenizer.c?view=markup#l155
) and don't break there but for the next ';'. another maybe less
confusing solution would be to explicitly add '(', ')' and ';' to the
result in the T_HALT_COMPILER condition before breking out of the
loop.
I will create a patch for this afternoon.

or could there be other important tokens after the __halt_compiler()
which should be present in the token_get_all() result?

--

Nicolas Grekas 	Fri, Sep 9, 2011 at 10:46

> don't break there but for the next ';'.

You can also just count the number of semantic token after T_HALT_COMPILER (ie excluding whitespace and comments) and once you hit 3, halt.

> less confusing solution would be to explicitly add '(', ')' and ';' to the
> result in the T_HALT_COMPILER condition before breking out of the loop.

If you mean verifying that '(', ')' and (';' or T_CLOSE_TAG) are effectively following T_HALT_COMPILER, I think that's part of the syntax analyser's job, not tokenizer's.
If you're ok with this argument, then just couting 3 tokens is really the most basic "syntax analysis" we have to do to fix the pb, don't you think?

> could there be other important tokens after the __halt_compiler()
> which should be present in the token_get_all() result?

Maybe the binary data itself, as a big T_INLINE_HTML for example ?

Also, if token_get_all you be made binary safe, that would be very cool ! (no more eating of \x00-\x1F inside regular code) :)

--

Nikita Popov  	Fri, Sep 9, 2011 at 19:39

> You can also just count the number of semantic token after T_HALT_COMPILER
> (ie excluding whitespace and comments) and once you hit 3, halt.
> [...]
> Maybe the binary data itself, as a big T_INLINE_HTML for example ?
In favor of both proposals!
Returning the next 3 tokens should be quite easy [1].
Returning the rest as an T_INLINE_HTML makes sense too, as extracting
the data is probably what you want. Though I have no idea how to
implement that ^^

[1]: https://github.com/nikic/php-src/commit/2d4cfa05947f04de447635ca5748b3b58defbfaf
(Not tested, only guessing)

--

Nikita Popov  	Tue, Sep 13, 2011 at 09:16

I just set up an PHP environment and wrote a proper patch (including
test changes) to make it collect the next three tokens. It's a git
patch and I'm not sure whether it's compatible with SVN patches. I
would love it if this would go into 5.4 before beta. I didn't know how
one could fetch the rest into T_INLINE_HTML, so I'm hoping on help
here from someone who actually knowns C there :)
 [2011-09-13 15:50 UTC] nikic@php.net
The following patch has been added/updated:

Patch Name: tokenizer_patch.txt
Revision:   1315929049
URL:        https://bugs.php.net/patch-display.php?bug=54089&patch=tokenizer_patch.txt&revision=1315929049
 [2011-09-13 17:11 UTC] nikic@php.net
The following patch has been added/updated:

Patch Name: tokenizer_patch_full.txt
Revision:   1315933914
URL:        https://bugs.php.net/patch-display.php?bug=54089&patch=tokenizer_patch_full.txt&revision=1315933914
 [2011-09-15 14:27 UTC] nikic@php.net
The following patch has been added/updated:

Patch Name: tokenizer_patch.txt
Revision:   1316096847
URL:        https://bugs.php.net/patch-display.php?bug=54089&patch=tokenizer_patch.txt&revision=1316096847
 [2011-09-15 14:32 UTC] nikic@php.net
The following patch has been added/updated:

Patch Name: tokenizer_patch_full.txt
Revision:   1316097166
URL:        https://bugs.php.net/patch-display.php?bug=54089&patch=tokenizer_patch_full.txt&revision=1316097166
 [2011-11-08 05:18 UTC] stas@php.net
This bug has been fixed in SVN.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.

 For Windows:

http://windows.php.net/snapshots/
 
Thank you for the report, and for helping us make PHP better.


 [2011-11-08 05:18 UTC] stas@php.net
-Status: Assigned +Status: Closed
 [2011-11-08 09:52 UTC] tyrael@php.net
for the record, it was merged in http://svn.php.net/viewvc?
view=revision&revision=318902 so it is only fixed for 5.4 and trunk.
as the development of 5.3.9 only started, there is a chance that this will be 
merged there also.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Dec 05 19:01:30 2024 UTC