php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #48446 Tokenizer reports two T_INLINE_HTML at tags starting with s
Submitted: 2009-06-01 16:48 UTC Modified: 2009-06-01 17:32 UTC
From: shawn at shawnbiddle dot com Assigned:
Status: Closed Package: Scripting Engine problem
PHP Version: 5.2.9 OS: Linux
Private report: No CVE-ID: None
 [2009-06-01 16:48 UTC] shawn at shawnbiddle dot com
Description:
------------
If token_get_all is run on a script that contains both PHP and HTML it will split T_INLINE_HTML tokens up any time it runs across an html tag starting with s. My example uses span but as I said, it's any tag starting with s.

Reproduce code:
---------------
<?php
   print_r(token_get_all('<?php echo "Hello World!"; ?><h6>Hello</h6><span class="test">Hello!</span><?php echo "Goodbye, World!"; ?>'));
 ?>

Expected result:
----------------
Array
(
  [0] => Array
      (
          [0] => 367
          [1] => <?php
          [2] => 1
      )

  [1] => Array
      (
          [0] => 316
          [1] => echo
          [2] => 1
      )

  [2] => Array
      (
          [0] => 370
          [1] =>
          [2] => 1
      )

  [3] => Array
      (
          [0] => 315
          [1] => "Hello World!"
          [2] => 1
      )

  [4] => ;
  [5] => Array
      (
          [0] => 370
          [1] =>
          [2] => 1
      )

  [6] => Array
      (
          [0] => 369
          [1] => ?>
          [2] => 1
      )

  [7] => Array
      (
          [0] => 311
          [1] => <h6>Hello</h6><span class="test">Hello!</span>
          [2] => 1
      )
  [8] => Array
      (
          [0] => 367
          [1] => <?php
          [2] => 1
      )

  [9] => Array
      (
          [0] => 316
          [1] => echo
          [2] => 1
      )

  [10] => Array
      (
          [0] => 370
          [1] =>
          [2] => 1
      )

  [11] => Array
      (
          [0] => 315
          [1] => "Goodbye, World!"
          [2] => 1
      )

  [12] => ;
  [13] => Array
      (
          [0] => 370
          [1] =>
          [2] => 1
      )

  [14] => Array
      (
          [0] => 369
          [1] => ?>
          [2] => 1
      )
)


Actual result:
--------------
Array
(
    [0] => Array
        (
            [0] => 367
            [1] => <?php
            [2] => 1
        )

    [1] => Array
        (
            [0] => 316
            [1] => echo
            [2] => 1
        )

    [2] => Array
        (
            [0] => 370
            [1] =>
            [2] => 1
        )

    [3] => Array
        (
            [0] => 315
            [1] => "Hello World!"
            [2] => 1
        )

    [4] => ;
    [5] => Array
        (
            [0] => 370
            [1] =>
            [2] => 1
        )

    [6] => Array
        (
            [0] => 369
            [1] => ?>
            [2] => 1
        )

    [7] => Array
        (
            [0] => 311
            [1] => <h6>Hello</h6>
            [2] => 1
        )

    [8] => Array
        (
            [0] => 311
            [1] => <s
            [2] => 1
        )

    [9] => Array
        (
            [0] => 311
            [1] => pan class="test">Hello!</span>
            [2] => 1
        )

    [10] => Array
        (
            [0] => 367
            [1] => <?php
            [2] => 1
        )

    [11] => Array
        (
            [0] => 316
            [1] => echo
            [2] => 1
        )

    [12] => Array
        (
            [0] => 370
            [1] =>
            [2] => 1
        )

    [13] => Array
        (
            [0] => 315
            [1] => "Goodbye, World!"
            [2] => 1
        )

    [14] => ;
    [15] => Array
        (
            [0] => 370
            [1] =>
            [2] => 1
        )

    [16] => Array
        (
            [0] => 369
            [1] => ?>
            [2] => 1
        )

)


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2009-06-01 17:16 UTC] mattwil@php.net
Yeah, that's just how the tokenizer/scanner has always worked. It stops at "<s" to avoid a long PHP opening tag (e.g. <script language="php">, etc.) from being taken as inline HTML. The regular expressions in the scanner can't "look ahead" to make sure what follows is NOT a PHP opening tag, and it would be more complicated, if it's even possible (been awhile since I looked), to do extra checking in the code after scanning additional input...

The good news, however, is that a new scanner is used for PHP 5.3, and as of a few weeks ago (5.3.0 RC2), it now works as you'd expect. All continuous HTML is kept as one token. :-)
 [2009-06-01 17:32 UTC] shawn at shawnbiddle dot com
Gotcha, guess I'll have to hackity-hack-hack around it until 5.3 is released stable.
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Fri Aug 01 13:00:03 2025 UTC