PHP :: Bug #33093 :: token_get_all() inconsistent results?

Bug #33093	token_get_all() inconsistent results?
Submitted:	2005-05-21 18:40 UTC	Modified:	2005-05-27 09:00 UTC
From:	pmjones@php.net	Assigned:
Status:	Not a bug	Package:	Unknown/Other Function
PHP Version:	5.0.4	OS:	Mac OS X 10.4.1
Private report:	No	CVE-ID:	None

View Developer Edit

[2005-05-21 18:40 UTC] pmjones@php.net

Description:
------------
It appears that token_get_all() does not report T_OPEN_TAG and T_WHITESPACE properly, depending on the whitespace following the opening tag.  For example, when parsing ...

<?php echo $var ?>

... you get T_OPEN_TAG, T_ECHO, T_WHITESPACE, T_VAR, T_WHITESPACE, and T_CLOSE_TAG.  This is not entirely the expected result (I would expect T_WHITESPACE between the open tag and the echo).

However, when parsing the functional equivalent...

<?php

echo $var

?>

you get "<", "?", T_STRING ("php"), T_WHITESPACE, T_ECHO, T_WHITESPACE, T_VAR, T_WHITESPACE, and T_CLOSE_TAG.  In addition, the first whitespace value reported does not include all the newlines (it drops one).

Although Macs use \r for their newlines natively, the test code uses the Unix-standard \n, so I don't think it's Mac-related.

If this is in fact a bug, the current behavior makes it difficult to write a reliable userland code auditor and report proper line numbers.

Am I missing some assumptions behind the behavior of the tokenizer function?

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2005-05-22 05:51 UTC] alan_k@php.net

wheres the missing data?

php -r 'var_dump(token_get_all("<?php echo \$var ?>"));'
array(6) {
  [0]=>
  array(2) {
    [0]=>
    int(366)
    [1]=>
    string(6) "<?php "
  }
  [1]=>
  array(2) {
    [0]=>
    int(316)
    [1]=>
    string(4) "echo"
  }
  [2]=>
  array(2) {
    [0]=>
    int(369)
    [1]=>
    string(1) " "
  }
  [3]=>
  array(2) {
    [0]=>
    int(309)
    [1]=>
    string(4) "$var"
  }
  [4]=>
  array(2) {
    [0]=>
    int(369)
    [1]=>
    string(1) " "
  }
  [5]=>
  array(2) {
    [0]=>
    int(368)
    [1]=>
    string(2) "?>"
  }
}




php -r 'var_dump(token_get_all("<?php \necho \$var\n?>"));'
array(7) {
  [0]=>
  array(2) {
    [0]=>
    int(366)
    [1]=>
    string(6) "<?php "
  }
  [1]=>
  array(2) {
    [0]=>
    int(369)
    [1]=>
    string(1) "
"
  }
  [2]=>
  array(2) {
    [0]=>
    int(316)
    [1]=>
    string(4) "echo"
  }
  [3]=>
  array(2) {
    [0]=>
    int(369)
    [1]=>
    string(1) " "
  }
  [4]=>
  array(2) {
    [0]=>
    int(309)
    [1]=>
    string(4) "$var"
  }
  [5]=>
  array(2) {
    [0]=>
    int(369)
    [1]=>
    string(1) "
"
  }
  [6]=>
  array(2) {
    [0]=>
    int(368)
    [1]=>
    string(2) "?>"
  }

[2005-05-22 05:55 UTC] alan_k@php.net

Actually the tokenizer just plugs into the internal tokenize code used by the engine. As such, the engine doesnt need to know some information, and is written to work as quickly and effeciently as possible, rather than being 100% dead on for parsing.

It's unlikely to be fixed just for token_get_all(), as introducing changes can have quite radical effects sometimes when touching that bit of code.

The values with the tokens should enable you to get the CR/LF count ok..

[2005-05-22 13:16 UTC] derick@php.net

Indeed, there is no bug here.

[2005-05-22 15:47 UTC] pmjones@php.net

The second command-line test should have pairs of \n newlines, not singles.

A corollary issue is that the results on the same code are inconsistent. Sometimes my token_get_all() returns the expected result (T_OPEN_TAG) and sometimes an unexpected result ("<", "?", T_STRING of "php").  Could there be a reason for the engine being "finicky"?

[2005-05-27 09:00 UTC] sniper@php.net

Still no bug here.

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Fri Nov 28 16:00:01 2025 UTC