php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #44418 Strange behaviour of preg_split with russian utf-8 strings
Submitted: 2008-03-12 16:00 UTC Modified: 2008-03-12 19:39 UTC
From: yarodin at gmail dot com Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 5.2.5 OS: Windows XP PRO/5.1.2600
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: yarodin at gmail dot com
New email:
PHP Version: OS:

 

 [2008-03-12 16:00 UTC] yarodin at gmail dot com
Description:
------------
$split = preg_split('#(\s)#', $value, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE );
make wrong spliting sentences on words when sentence at russian UTF-8 and begin with russian letter 'Р' (hex D0h A0h). For example russian "Расширенные поля пользователей" splits by php 5.2.5 on 7(!) words, but php4 is split correctly on 5 words. I think the problem at russian letter letter 'Р' wich split as single word.


Reproduce code:
---------------
<?
$value="&#1056;&#1072;&#1089;&#1096;&#1080;&#1088;&#1077;&#1085;&#1085;&#1099;&#1077; &#1087;&#1086;&#1083;&#1103; &#1087;&#1086;&#1083;&#1100;&#1079;&#1086;&#1074;&#1072;&#1090;&#1077;&#1083;&#1077;&#1081;";
header('Content-type: text/html; charset=utf-8');
print_r($value."<BR><BR><BR>");
$split = preg_split('#(\s)#', $value, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE );
print_r($split);
?>

Expected result:
----------------
Array ( [0] => &#1056;&#1072;&#1089;&#1096;&#1080;&#1088;&#1077;&#1085;&#1085;&#1099;&#1077; [1] => [2] => &#1087;&#1086;&#1083;&#1103; [3] => [4] => &#1087;&#1086;&#1083;&#1100;&#1079;&#1086;&#1074;&#1072;&#1090;&#1077;&#1083;&#1077;&#1081; )

Actual result:
--------------
Array ( [0] => &#1056; [1] => [2] => &#1072;&#1089;&#1096;&#1080;&#1088;&#1077;&#1085;&#1085;&#1099;&#1077; [3] => [4] => &#1087;&#1086;&#1083;&#1103; [5] => [6] => &#1087;&#1086;&#1083;&#1100;&#1079;&#1086;&#1074;&#1072;&#1090;&#1077;&#1083;&#1077;&#1081; )

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2008-03-12 19:39 UTC] nlopess@php.net
if the input is UTF-8 you need to use the 'u' modifier. (e.g. '#(\s)#u').
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Wed Jan 15 09:01:28 2025 UTC