php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #55261 Tokenize multiple strings in parallel is impossible using strtok
Submitted: 2011-07-21 14:08 UTC Modified: 2021-03-04 14:01 UTC
Votes:1
Avg. Score:3.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:1 (100.0%)
Same OS:1 (100.0%)
From: antonio dot bonifati at gmail dot com Assigned: cmb (profile)
Status: Wont fix Package: Strings related
PHP Version: Irrelevant OS:
Private report: No CVE-ID: None
 [2011-07-21 14:08 UTC] antonio dot bonifati at gmail dot com
Description:
------------
Hi there,
I have some very long strings (they come from a MySQL query selecting multiple 
GROUP_CONCAT-enated fields) that I would like to tokenize quickly using the 
strtok function, but I cannot do that in parallel as I need. That is fetching 
the first token from all strings, then the second token, etc.

E.g. given:

$a = '1,2,3,4';
$b = 'a,b,c,d';

strtok($a, ',') returns the first token of $a. If I then do strtok($b, ',') 
it will return the first token of $b, so I have '1' and 'a' together for the 
first iteration. In the next iteration I would need '2' and 'b', etc, but 
strtok(',') will only give me 'b', there is no way to fetch '2' from $a, since 
strtok can only be used on one string at a time.

I know I could do it implementing my own algorithm that accesses strings 
character by character, but that would be slow. I have tried using explode(',', 
$a/$b) but that uses too much memory, since it generates very big intermediate 
arrays, my strings are very long and numerous.

Would it be possible to have a tokenizer facility that does not 
store the string to tokenize, but only the index of the last processed 
character, so that it can be used on multiple strings?

I know that maybe I shouldn't use PHP for such tasks but rather code in C, but 
PHP is so quick to develop with that I cannot resist :)

I understand that strtok has been engineered this way to avoid passing the 
string over and over, which would be slow for long strings due to pass by value, 
therefore I would like to have another tokenizer function that accepts a 
reference to a string and always two arguments.


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2018-09-30 18:16 UTC] cmb@php.net
-Package: Unknown/Other Function +Package: Strings related
 [2018-09-30 18:16 UTC] cmb@php.net
FWIW, PHP's strtok() is modelled after C's strtok(); you're
looking for something like strtok_r().
 [2021-03-04 14:01 UTC] cmb@php.net
-Status: Open +Status: Wont fix -Assigned To: +Assigned To: cmb
 [2021-03-04 14:01 UTC] cmb@php.net
Oh, I misread the request back then; strtok_r() would help with
this.  Anyhow, no further comments or upvotes for almost ten
years, so I assume there is not much interest in this feature.

However, I think you can implement a solution in userland by using
strpos() with its $offset parameter and substr().
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Fri Jan 03 00:01:29 2025 UTC