PHP :: Bug #69433 :: function mb_split incorrectly parsing multibyte strings with mbregex

Bug #69433	function mb_split incorrectly parsing multibyte strings with mbregex
Submitted:	2015-04-12 22:33 UTC	Modified:	2016-08-13 10:58 UTC
From:	pffycloud at gmail dot com	Assigned:
Status:	Verified	Package:	mbstring related
PHP Version:	5.6.7	OS:	any
Private report:	No	CVE-ID:	None

View Developer Edit

[2015-04-12 22:33 UTC] pffycloud at gmail dot com

Description:
------------
The multibyte split function (function.mb_split) does not appear to be working as expected. 

Instead of parsing and converting the multibyte string into array elements as specified, the character elements are being split at unpredictable positions, yielding unpredictable array elements.

The "Test script" demonstrates an expected result for the first multibyte string, but then shows unexpected results for the next two multibyte strings.

Thank you for reviewing this potential bug.

Test script:
---------------
<?php
header('Content-Type: text/html; charset=UTF-8');
mb_regex_encoding('UTF-8');
mb_internal_encoding('UTF-8');

$arr = mb_split('\B', "你好"); # Array ( [0] => 你 [1] => 好 ) ## Okay!
print_r($arr);

$arr = mb_split('\B', "你你"); # Array ( [0] => 你 [1] => 你 ) ## Expected Result
print_r($arr); ## Instead, this message appears:
## Warning: mb_split(): mbregex search failure in mbsplit():
## no support in this configuration in /dir/foo.php on line 22

$arr = mb_split('\B', '隨著劇情的推進');
print_r($arr);
# Expected Result
# Array ( [0] => 隨 [1] => 著 [2] => 劇 [3] => 情 [4] => 的 [5] => 推 [6] => 進 )

# Actual Result, NOT expected
# Array ( [0] => 隨 [1] => � [2] => �劇 [3] => � [4] => �的 [5] => � [6] => �進 )

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2015-04-12 22:34 UTC] pffycloud at gmail dot com

-Operating System: Mac OS X +Operating System: Mac OS X 10.10 Yosemite

[2015-04-12 22:34 UTC] pffycloud at gmail dot com

Added Mac OS X 10.10 Yosemite to 'OS' for clarity.

[2015-04-14 20:19 UTC] yohgaki@php.net

-Package: *General Issues +Package: mbstring related -Operating System: Mac OS X 10.10 Yosemite +Operating System: any

[2015-04-14 20:19 UTC] yohgaki@php.net

It seems '\B' behaves differently than PCRE.

http://3v4l.org/fb4XN

[2016-07-31 16:30 UTC] cmb@php.net

-Status: Open +Status: Verified

[2016-07-31 16:30 UTC] cmb@php.net

> It seems '\B' behaves differently than PCRE.

Indeed, and I'd call that difference a bug, see
<https://3v4l.org/U3tdm>.

[2016-08-13 10:58 UTC] cmb@php.net

Basically, the problem is that neither mb_split() nor
mb_ereg_replace() properly cater to matches of length 0, so \B is
matched at the very start of the subject and also 1 position
ahead.

[2016-08-20 10:49 UTC] cmb@php.net

Related To: Bug #69256

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2026 The PHP Group All rights reserved.	Last updated: Mon Jul 20 17:00:01 2026 UTC