PHP :: Bug #44418 :: Strange behaviour of preg_split with russian utf-8 strings

Bug #44418	Strange behaviour of preg_split with russian utf-8 strings
Submitted:	2008-03-12 16:00 UTC	Modified:	2008-03-12 19:39 UTC
From:	yarodin at gmail dot com	Assigned:
Status:	Not a bug	Package:	PCRE related
PHP Version:	5.2.5	OS:	Windows XP PRO/5.1.2600
Private report:	No	CVE-ID:	None

View Developer Edit

[2008-03-12 16:00 UTC] yarodin at gmail dot com

Description:
------------
$split = preg_split('#(\s)#', $value, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE );
make wrong spliting sentences on words when sentence at russian UTF-8 and begin with russian letter '&#1056;' (hex D0h A0h). For example russian "&#1056;&#1072;&#1089;&#1096;&#1080;&#1088;&#1077;&#1085;&#1085;&#1099;&#1077; &#1087;&#1086;&#1083;&#1103; &#1087;&#1086;&#1083;&#1100;&#1079;&#1086;&#1074;&#1072;&#1090;&#1077;&#1083;&#1077;&#1081;" splits by php 5.2.5 on 7(!) words, but php4 is split correctly on 5 words. I think the problem at russian letter letter '&#1056;' wich split as single word.


Reproduce code:
---------------
<?
$value="&#1056;&#1072;&#1089;&#1096;&#1080;&#1088;&#1077;&#1085;&#1085;&#1099;&#1077; &#1087;&#1086;&#1083;&#1103; &#1087;&#1086;&#1083;&#1100;&#1079;&#1086;&#1074;&#1072;&#1090;&#1077;&#1083;&#1077;&#1081;";
header('Content-type: text/html; charset=utf-8');
print_r($value."<BR><BR><BR>");
$split = preg_split('#(\s)#', $value, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE );
print_r($split);
?>

Expected result:
----------------
Array ( [0] => &#1056;&#1072;&#1089;&#1096;&#1080;&#1088;&#1077;&#1085;&#1085;&#1099;&#1077; [1] => [2] => &#1087;&#1086;&#1083;&#1103; [3] => [4] => &#1087;&#1086;&#1083;&#1100;&#1079;&#1086;&#1074;&#1072;&#1090;&#1077;&#1083;&#1077;&#1081; )

Actual result:
--------------
Array ( [0] => &#1056; [1] => [2] => &#1072;&#1089;&#1096;&#1080;&#1088;&#1077;&#1085;&#1085;&#1099;&#1077; [3] => [4] => &#1087;&#1086;&#1083;&#1103; [5] => [6] => &#1087;&#1086;&#1083;&#1100;&#1079;&#1086;&#1074;&#1072;&#1090;&#1077;&#1083;&#1077;&#1081; )

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2008-03-12 19:39 UTC] nlopess@php.net

if the input is UTF-8 you need to use the 'u' modifier. (e.g. '#(\s)#u').

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2026 The PHP Group All rights reserved.	Last updated: Fri Mar 27 18:00:02 2026 UTC