php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #37794 preg_split doesn't work as it should be with \W on utf-8 string
Submitted: 2006-06-13 11:53 UTC Modified: 2006-06-14 08:44 UTC
From: jdespatis at yahoo dot fr Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 5.1.4 OS: Linux 2.6.15 Debian Testing
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: jdespatis at yahoo dot fr
New email:
PHP Version: OS:

 

 [2006-06-13 11:53 UTC] jdespatis at yahoo dot fr
Description:
------------
preg_split("/\W/u", $utf8_string) cuts the words !

Reproduce code:
---------------
print_r(preg_split("/(\W)/u", "этот", -1, PREG_SPLIT_DELIM_CAPTURE));

(watch out, i've put an utf8 string (you need to translate the html code into utf8), it's a russian string, (when you see the characters, you can see etot, with e being an epsilon inverted)

For now, i succeed in making my code work by using:
\P{L} instead of \W

Expected result:
----------------
Array
(
    [0] => этот
)

Actual result:
--------------
Array
(
    [0] =>
    [1] => э
    [2] =>
    [3] => т
    [4] =>
    [5] => о
    [6] =>
    [7] => т
    [8] =>
)

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2006-06-13 18:35 UTC] nlopess@php.net
/\W/ means match any non-whitespace. you probably want to use \w (lower case)
 [2006-06-13 21:08 UTC] nlopess@php.net
sorry, my last comment is incorrect. in utf mode you should use the property escapes (\p{..}), instead of non utf8-aware escapes, like \W.
 [2006-06-14 08:44 UTC] jdespatis at yahoo dot fr
Ok.
However i've read again the documentation
http://fr.php.net/manual/en/reference.pcre.pattern.syntax.php

And i don't see it's explicitely said "in utf-8 mode don't use \w"
i can only see: "Since PHP 4.4.0 and 5.1.0, three additional escape sequences to match generic character types are available when UTF-8 mode is selected. "

So, a reader understand this as: \w works AND in utf8 i have also \p{}

Would it be possible to update the documentation ? (for example, now, i have a doubt on \d, is it working on utf8 ?, i dunno...)

One thing more: i've found that ucwords() and ucfirst() are not utf8 aware, the documentation should be updated i think

Thanks
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Fri Sep 19 11:00:01 2025 UTC