Bug #52731 mb_strpos reports needle position incorrectly
Submitted: 2010-08-29 17:05 UTC Modified: 2016-07-31 16:15 UTC
From: tokul at users dot sourceforge dot net Assigned: moriyoshi
Status: Assigned Package: mbstring related
PHP Version: 5.3SVN-2010-08-29 (snap) OS:
Private report: No CVE-ID:
 [2010-08-29 17:05 UTC] tokul at users dot sourceforge dot net
If code sets incorrect character set (utf-8 instead of big5 in test case), mb_strpos() can incorrectly report needle position in some cases. It looks like $offset is calculated one way and results are calculated in some other way. See test code. mb_substr($str,$pos1,1,'utf-8') can be used to see character that is in reported needle position.

I understand that $str is not in UTF-8 charset, but position reported by mb_strpos() violates very basic strpos function behavior. Search is started after $offset position and result position is counted from string start. Result should not be lower than $offset or it should be boolean false.

compiled with /configure --prefix=/somepath --enable-cli --disable-all --enable-mbstring

Also tested PHP 5.2.0 (debian etch), 5.3.2-2 (debian squeeze) and 5.2.13 (standard PHP package). 5.2.13 and 5.3.2 results are the same. 5.2.0 results are a little bit different, but I was able to reproduce position calculation problem with more complex code.

Test script:
$str = "\xb7\x51 &\xb4\xa6\xb6\x7d";
$pos1 = mb_strpos($str,'&',0,'utf-8');
$pos2 = mb_strpos($str,'&',$pos1 + 1,'utf-8');

Expected result:
second var_dump() result should be higher than first one or should be boolean false.

result should not be lower than offset.

Actual result:


 [2010-08-29 17:21 UTC]
 [2016-07-31 16:15 UTC]
The $offset parameter of mb_strpos() is supposed to denote a
position in characters (actually, Unicode code points in this
case), not bytes. However, it's not possible to to count
characters in invalid UTF-8, so the function would have to error,
but actually checking for valid UTF-8 would slow down this and
other related functions even for valid UTF-8. Not sure, if that's
worth it.

Considering that several mbstring functions handle invalid UTF-8
badly[1], it might be best to leave it as is, and document that
these functions expect valid strings according to the chosen

[1] E.g.

  mb_convert_encoding($str, 'ISO-8859-1', 'UTF-8')

silently returns

  string(8) "?Q &???}"
