php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #6841 split, explode, strtok not parsing correctly
Submitted: 2000-09-21 22:55 UTC Modified: 2000-11-29 06:12 UTC
From: swenson at heronetwork dot com Assigned:
Status: Closed Package: Regexps related
PHP Version: 4.0.3pl1 OS: SuSE
Private report: No CVE-ID: None
 [2000-09-21 22:55 UTC] swenson at heronetwork dot com
On current build I am trying to parse an external tab and
space separated file (mime types). All of the tokenizing functions appear to not be performing the regex parsing correctly. All of these functions are performing very odd. Most interesting is that the first item is not even getting into the arrays (or token lists).

Example input:
# Simple mime file
text/html		htm HTM html HTML shtml SHTML
text/plain		java JAVA c C cc CC cpp CPP h H txt TXT
text/rtf		rtf RTF

Script:
<?php
$mimeFile="/tmp/mime.txt";

function get_mime_type ( $ext ) {
    global $mimeFile;
    if( !$ext || strlen(trim($ext)) == 0 ) {
        return "application/binary";
    }
    $fp = fopen ($mimeFile, "r");
    while(!feof($fp)) {
        $next = fgetss($fp, 300);
        $next = trim($next);
        echo("<p>Line in: $next<br>");
        if( !$next ) continue;
        // try it with [\t\s] or [:space:] or "  " etc.
        // it just has problems, same with split and strtok        $mime = explode("[       ]",$next);
        if( substr($mime[0], 0, 1) == "#" ) continue;
        $len = count($mime);
        echo("Line has: $len tokens - ");
        for( $x = 1 ; $x <= $len ; $x++ ) {
            if( !$mime[$x] ) continue;
            if( substr($mime[$x] ,0 , 1) == "#" ) break;
            echo("$x - $mime[$x], ");
            if( $ext == $mime[$x] ) return $mime[0];
        }
    }
    return "application/binary";
}
echo("<br><H1>Mime type for .H is " . get_mime_type(".H") . "</H1>");
?>

Sample output:
Line in: # Simple mime file
Line in: 
Line in: text/html htm HTM html HTML shtml SHTML
Line has: 6 tokens - 1 - HTM, 2 - html, 3 - HTML, 4 - shtml, 5 - SHTML, 
Line in: text/plain java JAVA c C cc CC cpp CPP h H txt TXT
Line has: 12 tokens - 1 - JAVA, 2 - c, 3 - C, 4 - cc, 5 - CC, 6 - cpp, 7 - CPP, 8 - h, 9 - H, 10 - txt, 11 - TXT, 
Line in: text/rtf rtf RTF
Line has: 2 tokens - 1 - RTF, 
Line in: 
Mime type for .H is application/binary

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2000-10-18 06:09 UTC] stas@php.net
Could you please provide short example of any of these functions working wrong - as a short isolated script, not as a part of complicated code?
 [2000-10-18 14:27 UTC] swenson at heronetwork dot com
I already simplified it down and added a bunch of echo's so you can see the walk through.

1) Place the sample input into a file called 'mime.txt'.
2) Place the php script into a file in the same directory.
3) Open the example script. I put a simple call in it already.

The events are pretty straight forward:
1) Open a file
2) Read a line into an array of tokens
3) skip the first token (it is the mime type)
4) check to see if the pased in arg matches any of the other tokens
5) return the first token from a line with a matching string or default to "application/binary"

The extra comments in the script are for other regular expressions that SHOULD work but do not.
 [2000-10-18 14:47 UTC] swenson at heronetwork dot com
Sorry I spotted a bug in the sample call. It should be:

echo("<br><H1>Mime type for .H is " . get_mime_type("H") . "</H1>");

The real problem here is that if you look at one sample input:

"text/html               htm HTM html HTML shtml SHTML"

Which in regular expression land could look like this:
text/html\t\shtm\sHTM\shtml\sHTML\sshtml\sSHTML\n

when parsed with explode("[\t\s]") or the same without the regex escapes (as in the current example script) an array of 7 elements should be created:
0=>"text/html"
1=> "htm"
2=> "HTM"
3=> "html"
4=> "HTML"
5=> "shtml"
6=> "SHMTL"

But instead (as shown in the sample output) an array of 6 elements is being created:
0=> "htm"
1=> "HTM"
2=> "html"
3=> "HTML"
4=> "shtml"
5=> "SHMTL"

Additionally ALL of the regular expresion character classes [:space:] and escapes '\t\n\s' are not functioning correctly. I have tried rebuilding php with perl regex, system and php regex support without any of them working.

Having written a few regex engines myself I know how hard it is to get it right. But this is really a huge bug in explode and strtok.
 [2000-10-19 06:12 UTC] stas@php.net
I'm a little confused by your examples. First, explode does not accept regular expression argument. It requires string separator argument. Second, split, which does expect regular expression, does not know \s escape - that's Perl escape, and you should use preg_split for it or use [:space:] character class.

When I tried split("[\t ]",...) or preg_split("/[\t\s]/", ...) it worked flawlessly for me.

Please provide example of not working  code along the following lines:

===example start
<?
$str = "MY String";
$array = split("pattern",$str);
var_dump($array);
?>

Should be: [0]=>"MY",[1]=>"string", but prints [0]=>"Y s",[1]=>"tring"
===example end
 [2000-10-19 14:12 UTC] swenson at heronetwork dot com
OK, sorry for giving too large a sample program.

Try this:
<?
$str = "   MY    String           is                            hosed    ";
$array = split("[ \t]*",$str);
$len = count($array);
echo("String has: $len tokens\n");
for( $x = 0 ; $x < $len ; $x++ ) {
    print " \$array[$x] => \"$array[$x]\"\n";
}
?>

It returns this:
Warning:  bad regular expression for split() in split.php on line 3
String has: 1 tokens
 $array[0] => ""

When it should return:
String has: 4 tokens
 $array[0] => "MY"
 $array[1] => "String"
 $array[2] => "is"
 $array[3] => "hosed"

Replacing split("[ \t]*",$str); with split("[[:space:]]*",$str); returns the same error.
 [2000-10-19 15:38 UTC] joey@php.net
There *does* appear to be a bug here, but it is not what
you seem to think it is. Using your most recent example,
you should have:

String has: 6 tokens
 $array[0] => ""
 $array[1] => "MY"
 $array[2] => "String"
 $array[3] => "is"
 $array[4] => "hosed"
 $array[5] => ""

That's what you asked for.
The bug has something to do with *. Replacing the * in your
example with + gets rid of the error, but is obviously
not quite the same.
 [2000-11-29 06:12 UTC] stas@php.net
OK, the problem is that your regular expression allows null
matches (i.e., [ \t]* matches empty string). That generally
means that delimiter can be empty - which is wrong, because
delimiter never can be empty - it's meaningless. So actually
split is right to say your regexp is bad. Use + instead of *.
 
PHP Copyright © 2001-2019 The PHP Group
All rights reserved.
Last updated: Thu Sep 19 13:01:34 2019 UTC