php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #55261 Tokenize multiple strings in parallel is impossible using strtok
Submitted: 2011-07-21 14:08 UTC Modified: 2021-03-04 14:01 UTC
Votes:1
Avg. Score:3.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:1 (100.0%)
Same OS:1 (100.0%)
From: antonio dot bonifati at gmail dot com Assigned: cmb (profile)
Status: Wont fix Package: Strings related
PHP Version: Irrelevant OS:
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: antonio dot bonifati at gmail dot com
New email:
PHP Version: OS:

 

 [2011-07-21 14:08 UTC] antonio dot bonifati at gmail dot com
Description:
------------
Hi there,
I have some very long strings (they come from a MySQL query selecting multiple 
GROUP_CONCAT-enated fields) that I would like to tokenize quickly using the 
strtok function, but I cannot do that in parallel as I need. That is fetching 
the first token from all strings, then the second token, etc.

E.g. given:

$a = '1,2,3,4';
$b = 'a,b,c,d';

strtok($a, ',') returns the first token of $a. If I then do strtok($b, ',') 
it will return the first token of $b, so I have '1' and 'a' together for the 
first iteration. In the next iteration I would need '2' and 'b', etc, but 
strtok(',') will only give me 'b', there is no way to fetch '2' from $a, since 
strtok can only be used on one string at a time.

I know I could do it implementing my own algorithm that accesses strings 
character by character, but that would be slow. I have tried using explode(',', 
$a/$b) but that uses too much memory, since it generates very big intermediate 
arrays, my strings are very long and numerous.

Would it be possible to have a tokenizer facility that does not 
store the string to tokenize, but only the index of the last processed 
character, so that it can be used on multiple strings?

I know that maybe I shouldn't use PHP for such tasks but rather code in C, but 
PHP is so quick to develop with that I cannot resist :)

I understand that strtok has been engineered this way to avoid passing the 
string over and over, which would be slow for long strings due to pass by value, 
therefore I would like to have another tokenizer function that accepts a 
reference to a string and always two arguments.


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2018-09-30 18:16 UTC] cmb@php.net
-Package: Unknown/Other Function +Package: Strings related
 [2018-09-30 18:16 UTC] cmb@php.net
FWIW, PHP's strtok() is modelled after C's strtok(); you're
looking for something like strtok_r().
 [2021-03-04 14:01 UTC] cmb@php.net
-Status: Open +Status: Wont fix -Assigned To: +Assigned To: cmb
 [2021-03-04 14:01 UTC] cmb@php.net
Oh, I misread the request back then; strtok_r() would help with
this.  Anyhow, no further comments or upvotes for almost ten
years, so I assume there is not much interest in this feature.

However, I think you can implement a solution in userland by using
strpos() with its $offset parameter and substr().
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Sun Jan 05 05:01:28 2025 UTC