php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #52731 mb_strpos reports needle position incorrectly
Submitted: 2010-08-29 17:05 UTC Modified: 2021-10-11 16:33 UTC
Votes:3
Avg. Score:3.0 ± 0.8
Reproduced:2 of 2 (100.0%)
Same Version:0 (0.0%)
Same OS:0 (0.0%)
From: tokul at users dot sourceforge dot net Assigned:
Status: Open Package: mbstring related
PHP Version: 5.3SVN-2010-08-29 (snap) OS:
Private report: No CVE-ID: None
Have you experienced this issue?
Rate the importance of this bug to you:

 [2010-08-29 17:05 UTC] tokul at users dot sourceforge dot net
Description:
------------
If code sets incorrect character set (utf-8 instead of big5 in test case), mb_strpos() can incorrectly report needle position in some cases. It looks like $offset is calculated one way and results are calculated in some other way. See test code. mb_substr($str,$pos1,1,'utf-8') can be used to see character that is in reported needle position.

I understand that $str is not in UTF-8 charset, but position reported by mb_strpos() violates very basic strpos function behavior. Search is started after $offset position and result position is counted from string start. Result should not be lower than $offset or it should be boolean false.

php5.3-201008291230
compiled with /configure --prefix=/somepath --enable-cli --disable-all --enable-mbstring

Also tested PHP 5.2.0 (debian etch), 5.3.2-2 (debian squeeze) and 5.2.13 (standard PHP package). 5.2.13 and 5.3.2 results are the same. 5.2.0 results are a little bit different, but I was able to reproduce position calculation problem with more complex code.


Test script:
---------------
$str = "\xb7\x51 &\xb4\xa6\xb6\x7d";
$pos1 = mb_strpos($str,'&',0,'utf-8');
var_dump($pos1);
$pos2 = mb_strpos($str,'&',$pos1 + 1,'utf-8');
var_dump($pos2);

Expected result:
----------------
second var_dump() result should be higher than first one or should be boolean false.

result should not be lower than offset.


Actual result:
--------------
int(2)
int(2)

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2010-08-29 17:21 UTC] felipe@php.net
-Status: Open +Status: Assigned -Assigned To: +Assigned To: moriyoshi
 [2016-07-31 16:15 UTC] cmb@php.net
The $offset parameter of mb_strpos() is supposed to denote a
position in characters (actually, Unicode code points in this
case), not bytes. However, it's not possible to to count
characters in invalid UTF-8, so the function would have to error,
but actually checking for valid UTF-8 would slow down this and
other related functions even for valid UTF-8. Not sure, if that's
worth it.

Considering that several mbstring functions handle invalid UTF-8
badly[1], it might be best to leave it as is, and document that
these functions expect valid strings according to the chosen
encoding.

[1] E.g.

  mb_convert_encoding($str, 'ISO-8859-1', 'UTF-8')

silently returns

  string(8) "?Q &???}"
 [2017-10-24 06:33 UTC] kalle@php.net
-Status: Assigned +Status: Open -Assigned To: moriyoshi +Assigned To:
 [2021-10-11 16:33 UTC] cmb@php.net
For reference: <https://3v4l.org/5650p>.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Tue Apr 16 06:01:30 2024 UTC