php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #52731 mb_strpos reports needle position incorrectly
Submitted: 2010-08-29 17:05 UTC Modified: 2021-10-11 16:33 UTC
Votes:3
Avg. Score:3.0 ± 0.8
Reproduced:2 of 2 (100.0%)
Same Version:0 (0.0%)
Same OS:0 (0.0%)
From: tokul at users dot sourceforge dot net Assigned:
Status: Open Package: mbstring related
PHP Version: 5.3SVN-2010-08-29 (snap) OS:
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: tokul at users dot sourceforge dot net
New email:
PHP Version: OS:

 

 [2010-08-29 17:05 UTC] tokul at users dot sourceforge dot net
Description:
------------
If code sets incorrect character set (utf-8 instead of big5 in test case), mb_strpos() can incorrectly report needle position in some cases. It looks like $offset is calculated one way and results are calculated in some other way. See test code. mb_substr($str,$pos1,1,'utf-8') can be used to see character that is in reported needle position.

I understand that $str is not in UTF-8 charset, but position reported by mb_strpos() violates very basic strpos function behavior. Search is started after $offset position and result position is counted from string start. Result should not be lower than $offset or it should be boolean false.

php5.3-201008291230
compiled with /configure --prefix=/somepath --enable-cli --disable-all --enable-mbstring

Also tested PHP 5.2.0 (debian etch), 5.3.2-2 (debian squeeze) and 5.2.13 (standard PHP package). 5.2.13 and 5.3.2 results are the same. 5.2.0 results are a little bit different, but I was able to reproduce position calculation problem with more complex code.


Test script:
---------------
$str = "\xb7\x51 &\xb4\xa6\xb6\x7d";
$pos1 = mb_strpos($str,'&',0,'utf-8');
var_dump($pos1);
$pos2 = mb_strpos($str,'&',$pos1 + 1,'utf-8');
var_dump($pos2);

Expected result:
----------------
second var_dump() result should be higher than first one or should be boolean false.

result should not be lower than offset.


Actual result:
--------------
int(2)
int(2)

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2010-08-29 17:21 UTC] felipe@php.net
-Status: Open +Status: Assigned -Assigned To: +Assigned To: moriyoshi
 [2016-07-31 16:15 UTC] cmb@php.net
The $offset parameter of mb_strpos() is supposed to denote a
position in characters (actually, Unicode code points in this
case), not bytes. However, it's not possible to to count
characters in invalid UTF-8, so the function would have to error,
but actually checking for valid UTF-8 would slow down this and
other related functions even for valid UTF-8. Not sure, if that's
worth it.

Considering that several mbstring functions handle invalid UTF-8
badly[1], it might be best to leave it as is, and document that
these functions expect valid strings according to the chosen
encoding.

[1] E.g.

  mb_convert_encoding($str, 'ISO-8859-1', 'UTF-8')

silently returns

  string(8) "?Q &???}"
 [2017-10-24 06:33 UTC] kalle@php.net
-Status: Assigned +Status: Open -Assigned To: moriyoshi +Assigned To:
 [2021-10-11 16:33 UTC] cmb@php.net
For reference: <https://3v4l.org/5650p>.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sat Oct 05 20:01:26 2024 UTC