php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #54089 token_get_all with regards to __halt_compiler is not binary safe
Submitted: 2011-02-24 13:16 UTC Modified: 2011-11-08 09:52 UTC
From: nicolas dot grekas+php at gmail dot com Assigned: iliaa (profile)
Status: Closed Package: Unknown/Other Function
PHP Version: 5.3.5 OS: Any
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: nicolas dot grekas+php at gmail dot com
New email:
PHP Version: OS:

 

 [2011-02-24 13:16 UTC] nicolas dot grekas+php at gmail dot com
Description:
------------
A. token_get_all() eats some characters which are not allowed in plain PHP code and trigger a "Unexpected character in input" warning.

B. after a T_HALT_COMPILER, the tokens are still identified as if they were not after this T_HALT_COMPILER, when in reality they are just random data.

So, when using token_get_all on code which contains a T_HALT_COMPILER, the data after that is corrupted because of A.

Test script:
---------------
<?php

$code = "<?php __halt_compiler();\x01?>\x02";

$tokens = token_get_all($code);
$reconstructed_code = '';

foreach ($tokens as $t)
{
	$reconstructed_code .= isset($t[1]) ? $t[1] : $t;
}

var_dump($code);
var_dump($reconstructed_code);


Expected result:
----------------
string(28) "<?php __halt_compiler();?>"
string(28) "<?php __halt_compiler();?>"


Actual result:
--------------
PHP Warning:  Unexpected character in input:  '' (ASCII=1) state=0 on line 5
string(28) "<?php __halt_compiler();?>"
string(27) "<?php __halt_compiler();?>"


Patches

tokenizer_patch_full.txt (last revision 2011-09-15 14:32 UTC by nikic@php.net)
tokenizer_patch.txt (last revision 2011-09-15 14:27 UTC by nikic@php.net)

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2011-02-28 16:18 UTC] iliaa@php.net
Automatic comment from SVN on behalf of iliaa
Revision: http://svn.php.net/viewvc/?view=revision&amp;revision=308761
Log: Fixed bug #54089 (token_get_all() does not stop after __halt_compiler).
 [2011-02-28 16:18 UTC] iliaa@php.net
-Status: Open +Status: Closed -Assigned To: +Assigned To: iliaa
 [2011-02-28 16:18 UTC] iliaa@php.net
This bug has been fixed in SVN.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.


 [2011-03-01 10:15 UTC] nicolas dot grekas+php at gmail dot com
Thanks for the patch. After reading it, I'm not sure it really helps, considering that the stop on T_HALT_COMPILER was already easily feasible in plain PHP. In fact, it may be worse, because now if I want to access data after T_HALT_COMPILER in PHP using tokenizer, I have to write more code, as the data is missing from the token array.

As a corner case also, __halt_compiler is always followed by 3 valid tokens: "(", ")" then ";" or T_CLOSE_TAG, with any number of T_WHITESPACE/T_COMMENT/T_DOC_COMMENT between.

My view is that this "bug" can be fixed by introducing a new T_UNEXPECTED_CHARACTER token type, matching those "Unexpected character in input" warnings: this would fix token_get_all binary unsafeness. Is it a good idea? I don't know if it's difficult to implement, nor if it would introduce any BC break, so maybe a "Won't fix" on this bug is enough?

Could the patch be reverted? I'm afraid it's the best for tokenizer users...
Here is what I was using before the patch to work around this binary incompatibility:

<?php

// New token matching an "Unexpected character in input"
define('T_UNEXPECTED_CHARACTER', -1); 

$src_tokens = @token_get_all($code);
$bin_tokens = array();
$offset =  0;
$i      = -1;

while (isset($src_tokens[++$i]))
{
	$t = isset($src_tokens[$i][1]) ? $src_tokens[$i][1] : $src_tokens[$i];

	while ($t[0] !== $code[$offset])
		$bin_tokens[] = array(T_UNEXPECTED_CHARACTER, $code[$offset++]);

	$offset += strlen($t);
	$bin_tokens[] = $src_tokens[$i];
	unset($src_tokens[$i]);
}

// Here, $bin_tokens contains binary safe tokens

?>
 [2011-03-03 15:44 UTC] nicolas dot grekas+php at gmail dot com
-Status: Closed +Status: Assigned
 [2011-03-03 15:44 UTC] nicolas dot grekas+php at gmail dot com
Sorry to reopen. As 5.3.6 is in RC, I just want to be sure my previous comment has been read. What about reverting the patch ?
 [2011-03-10 08:26 UTC] nicolas dot grekas+php at gmail dot com
-Status: Assigned +Status: Open
 [2011-03-10 08:26 UTC] nicolas dot grekas+php at gmail dot com
Really, the actual patch is a step backward, I can't do things that were easy before (getting the halt_compiler_offset with token_get_all)...
Please consider reverting it!
 [2011-03-17 23:51 UTC] nicolas dot grekas+php at gmail dot com
-Status: Open +Status: Closed
 [2011-03-17 23:51 UTC] nicolas dot grekas+php at gmail dot com
Latest 5.3.6 has been released with the patch...
So now token_get_all() stops on T_HALT_COMPILER for ever :)
 [2011-09-13 07:49 UTC] nicolas dot grekas+php at gmail dot com
-Status: Closed +Status: Assigned
 [2011-09-13 07:49 UTC] nicolas dot grekas+php at gmail dot com
Excerpt from internals:

Nikita Popov 	Fri, Sep 9, 2011 at 09:15

[...]
The problem with the patch is, that there are some tokens after
T_HALT_COMPILER that are of interest, namely the '(' ')' ';'. After
the patch it is impossible to get those tokens, without either
relexing the code after T_HALT_COMPILER (that way you get the binary
data problem back, just with much more complex code) or writing a
regular expression to match it (which is really hard, as there may be
any token dropped by the PHP parser in there, i.e. whitespace,
comments, PHP tags).

[...]
I would like this patch to be reverted on the 5.4 and trunk branches.
I assume it's too late to revert it on 5.3, as it has been there for
some time already. It is just counterproductive. (Alternatively one
could fix token_get_all to return the (); tokens after
__halt_compiler, too, but that would be hard, probably.)

--

Ferenc Kovacs 	Fri, Sep 9, 2011 at 10:01

I think that it wouldn't be too hard.
From a quick glance on the code, we should introduce a new local
variable, set that to true where we break now (
http://svn.php.net/viewvc/php/php-src/trunk/ext/tokenizer/tokenizer.c?view=markup#l155
) and don't break there but for the next ';'. another maybe less
confusing solution would be to explicitly add '(', ')' and ';' to the
result in the T_HALT_COMPILER condition before breking out of the
loop.
I will create a patch for this afternoon.

or could there be other important tokens after the __halt_compiler()
which should be present in the token_get_all() result?

--

Nicolas Grekas 	Fri, Sep 9, 2011 at 10:46

> don't break there but for the next ';'.

You can also just count the number of semantic token after T_HALT_COMPILER (ie excluding whitespace and comments) and once you hit 3, halt.

> less confusing solution would be to explicitly add '(', ')' and ';' to the
> result in the T_HALT_COMPILER condition before breking out of the loop.

If you mean verifying that '(', ')' and (';' or T_CLOSE_TAG) are effectively following T_HALT_COMPILER, I think that's part of the syntax analyser's job, not tokenizer's.
If you're ok with this argument, then just couting 3 tokens is really the most basic "syntax analysis" we have to do to fix the pb, don't you think?

> could there be other important tokens after the __halt_compiler()
> which should be present in the token_get_all() result?

Maybe the binary data itself, as a big T_INLINE_HTML for example ?

Also, if token_get_all you be made binary safe, that would be very cool ! (no more eating of \x00-\x1F inside regular code) :)

--

Nikita Popov  	Fri, Sep 9, 2011 at 19:39

> You can also just count the number of semantic token after T_HALT_COMPILER
> (ie excluding whitespace and comments) and once you hit 3, halt.
> [...]
> Maybe the binary data itself, as a big T_INLINE_HTML for example ?
In favor of both proposals!
Returning the next 3 tokens should be quite easy [1].
Returning the rest as an T_INLINE_HTML makes sense too, as extracting
the data is probably what you want. Though I have no idea how to
implement that ^^

[1]: https://github.com/nikic/php-src/commit/2d4cfa05947f04de447635ca5748b3b58defbfaf
(Not tested, only guessing)

--

Nikita Popov  	Tue, Sep 13, 2011 at 09:16

I just set up an PHP environment and wrote a proper patch (including
test changes) to make it collect the next three tokens. It's a git
patch and I'm not sure whether it's compatible with SVN patches. I
would love it if this would go into 5.4 before beta. I didn't know how
one could fetch the rest into T_INLINE_HTML, so I'm hoping on help
here from someone who actually knowns C there :)
 [2011-09-13 15:50 UTC] nikic@php.net
The following patch has been added/updated:

Patch Name: tokenizer_patch.txt
Revision:   1315929049
URL:        https://bugs.php.net/patch-display.php?bug=54089&patch=tokenizer_patch.txt&revision=1315929049
 [2011-09-13 17:11 UTC] nikic@php.net
The following patch has been added/updated:

Patch Name: tokenizer_patch_full.txt
Revision:   1315933914
URL:        https://bugs.php.net/patch-display.php?bug=54089&patch=tokenizer_patch_full.txt&revision=1315933914
 [2011-09-15 14:27 UTC] nikic@php.net
The following patch has been added/updated:

Patch Name: tokenizer_patch.txt
Revision:   1316096847
URL:        https://bugs.php.net/patch-display.php?bug=54089&patch=tokenizer_patch.txt&revision=1316096847
 [2011-09-15 14:32 UTC] nikic@php.net
The following patch has been added/updated:

Patch Name: tokenizer_patch_full.txt
Revision:   1316097166
URL:        https://bugs.php.net/patch-display.php?bug=54089&patch=tokenizer_patch_full.txt&revision=1316097166
 [2011-11-08 05:18 UTC] stas@php.net
This bug has been fixed in SVN.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.

 For Windows:

http://windows.php.net/snapshots/
 
Thank you for the report, and for helping us make PHP better.


 [2011-11-08 05:18 UTC] stas@php.net
-Status: Assigned +Status: Closed
 [2011-11-08 09:52 UTC] tyrael@php.net
for the record, it was merged in http://svn.php.net/viewvc?
view=revision&revision=318902 so it is only fixed for 5.4 and trunk.
as the development of 5.3.9 only started, there is a chance that this will be 
merged there also.
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Mon Feb 03 18:01:32 2025 UTC