php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #69433 function mb_split incorrectly parsing multibyte strings with mbregex
Submitted: 2015-04-12 22:33 UTC Modified: 2016-08-13 10:58 UTC
From: pffycloud at gmail dot com Assigned:
Status: Verified Package: mbstring related
PHP Version: 5.6.7 OS: any
Private report: No CVE-ID: None
 [2015-04-12 22:33 UTC] pffycloud at gmail dot com
Description:
------------
The multibyte split function (function.mb_split) does not appear to be working as expected. 

Instead of parsing and converting the multibyte string into array elements as specified, the character elements are being split at unpredictable positions, yielding unpredictable array elements.

The "Test script" demonstrates an expected result for the first multibyte string, but then shows unexpected results for the next two multibyte strings.

Thank you for reviewing this potential bug.

Test script:
---------------
<?php
header('Content-Type: text/html; charset=UTF-8');
mb_regex_encoding('UTF-8');
mb_internal_encoding('UTF-8');

$arr = mb_split('\B', "你好"); # Array ( [0] => 你 [1] => 好 ) ## Okay!
print_r($arr);

$arr = mb_split('\B', "你你"); # Array ( [0] => 你 [1] => 你 ) ## Expected Result
print_r($arr); ## Instead, this message appears:
## Warning: mb_split(): mbregex search failure in mbsplit():
## no support in this configuration in /dir/foo.php on line 22

$arr = mb_split('\B', '隨著劇情的推進');
print_r($arr);
# Expected Result
# Array ( [0] => 隨 [1] => 著 [2] => 劇 [3] => 情 [4] => 的 [5] => 推 [6] => 進 )

# Actual Result, NOT expected
# Array ( [0] => 隨 [1] => � [2] => �劇 [3] => � [4] => �的 [5] => � [6] => �進 )


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2015-04-12 22:34 UTC] pffycloud at gmail dot com
-Operating System: Mac OS X +Operating System: Mac OS X 10.10 Yosemite
 [2015-04-12 22:34 UTC] pffycloud at gmail dot com
Added Mac OS X 10.10 Yosemite to 'OS' for clarity.
 [2015-04-14 20:19 UTC] yohgaki@php.net
-Package: *General Issues +Package: mbstring related -Operating System: Mac OS X 10.10 Yosemite +Operating System: any
 [2015-04-14 20:19 UTC] yohgaki@php.net
It seems '\B' behaves differently than PCRE.

http://3v4l.org/fb4XN
 [2016-07-31 16:30 UTC] cmb@php.net
-Status: Open +Status: Verified
 [2016-07-31 16:30 UTC] cmb@php.net
> It seems '\B' behaves differently than PCRE.

Indeed, and I'd call that difference a bug, see
<https://3v4l.org/U3tdm>.
 [2016-08-13 10:58 UTC] cmb@php.net
Basically, the problem is that neither mb_split() nor
mb_ereg_replace() properly cater to matches of length 0, so \B is
matched at the very start of the subject and also 1 position
ahead.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Mon Oct 14 08:01:27 2024 UTC