php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #44418 Strange behaviour of preg_split with russian utf-8 strings
Submitted: 2008-03-12 16:00 UTC Modified: 2008-03-12 19:39 UTC
From: yarodin at gmail dot com Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 5.2.5 OS: Windows XP PRO/5.1.2600
Private report: No CVE-ID: None
 [2008-03-12 16:00 UTC] yarodin at gmail dot com
Description:
------------
$split = preg_split('#(\s)#', $value, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE );
make wrong spliting sentences on words when sentence at russian UTF-8 and begin with russian letter 'Р' (hex D0h A0h). For example russian "Расширенные поля пользователей" splits by php 5.2.5 on 7(!) words, but php4 is split correctly on 5 words. I think the problem at russian letter letter 'Р' wich split as single word.


Reproduce code:
---------------
<?
$value="&#1056;&#1072;&#1089;&#1096;&#1080;&#1088;&#1077;&#1085;&#1085;&#1099;&#1077; &#1087;&#1086;&#1083;&#1103; &#1087;&#1086;&#1083;&#1100;&#1079;&#1086;&#1074;&#1072;&#1090;&#1077;&#1083;&#1077;&#1081;";
header('Content-type: text/html; charset=utf-8');
print_r($value."<BR><BR><BR>");
$split = preg_split('#(\s)#', $value, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE );
print_r($split);
?>

Expected result:
----------------
Array ( [0] => &#1056;&#1072;&#1089;&#1096;&#1080;&#1088;&#1077;&#1085;&#1085;&#1099;&#1077; [1] => [2] => &#1087;&#1086;&#1083;&#1103; [3] => [4] => &#1087;&#1086;&#1083;&#1100;&#1079;&#1086;&#1074;&#1072;&#1090;&#1077;&#1083;&#1077;&#1081; )

Actual result:
--------------
Array ( [0] => &#1056; [1] => [2] => &#1072;&#1089;&#1096;&#1080;&#1088;&#1077;&#1085;&#1085;&#1099;&#1077; [3] => [4] => &#1087;&#1086;&#1083;&#1103; [5] => [6] => &#1087;&#1086;&#1083;&#1100;&#1079;&#1086;&#1074;&#1072;&#1090;&#1077;&#1083;&#1077;&#1081; )

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2008-03-12 19:39 UTC] nlopess@php.net
if the input is UTF-8 you need to use the 'u' modifier. (e.g. '#(\s)#u').
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Apr 18 17:01:28 2024 UTC