php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #37794 preg_split doesn't work as it should be with \W on utf-8 string
Submitted: 2006-06-13 11:53 UTC Modified: 2006-06-14 08:44 UTC
From: jdespatis at yahoo dot fr Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 5.1.4 OS: Linux 2.6.15 Debian Testing
Private report: No CVE-ID: None
 [2006-06-13 11:53 UTC] jdespatis at yahoo dot fr
Description:
------------
preg_split("/\W/u", $utf8_string) cuts the words !

Reproduce code:
---------------
print_r(preg_split("/(\W)/u", "этот", -1, PREG_SPLIT_DELIM_CAPTURE));

(watch out, i've put an utf8 string (you need to translate the html code into utf8), it's a russian string, (when you see the characters, you can see etot, with e being an epsilon inverted)

For now, i succeed in making my code work by using:
\P{L} instead of \W

Expected result:
----------------
Array
(
    [0] => этот
)

Actual result:
--------------
Array
(
    [0] =>
    [1] => э
    [2] =>
    [3] => т
    [4] =>
    [5] => о
    [6] =>
    [7] => т
    [8] =>
)

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2006-06-13 18:35 UTC] nlopess@php.net
/\W/ means match any non-whitespace. you probably want to use \w (lower case)
 [2006-06-13 21:08 UTC] nlopess@php.net
sorry, my last comment is incorrect. in utf mode you should use the property escapes (\p{..}), instead of non utf8-aware escapes, like \W.
 [2006-06-14 08:44 UTC] jdespatis at yahoo dot fr
Ok.
However i've read again the documentation
http://fr.php.net/manual/en/reference.pcre.pattern.syntax.php

And i don't see it's explicitely said "in utf-8 mode don't use \w"
i can only see: "Since PHP 4.4.0 and 5.1.0, three additional escape sequences to match generic character types are available when UTF-8 mode is selected. "

So, a reader understand this as: \w works AND in utf8 i have also \p{}

Would it be possible to update the documentation ? (for example, now, i have a doubt on \d, is it working on utf8 ?, i dunno...)

One thing more: i've found that ucwords() and ucfirst() are not utf8 aware, the documentation should be updated i think

Thanks
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Sun Oct 26 22:00:01 2025 UTC