php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #79238 preg_split returns weird results
Submitted: 2020-02-07 07:50 UTC Modified: 2020-02-07 08:00 UTC
From: vuongvankhanh89 at gmail dot com Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 7.4.2 OS: Ubuntu 18.04
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: vuongvankhanh89 at gmail dot com
New email:
PHP Version: OS:

 

 [2020-02-07 07:50 UTC] vuongvankhanh89 at gmail dot com
Description:
------------
Hi, today i created a pattern to separate text to an array. 
Regex string is very simple:

preg_split('/\R| /m', '翶耆倈者耈翶耆倈傀堀X蔀耖耄')

it detects whitespace and linebreak as the separators and then converts string to array of strings. 

in the exam above there is no whitespace or linebreak but result i got is an array with 2 elements. 

array(2) {
  [0]=>
  string(11) "翶耆倈�"
  [1]=>
  string(28) "耈翶耆倈傀堀X蔀耖耄"
}

者 has been turned to a separator, it also becomes a weird charactor.

Kindly help! Thanks in advance.




Test script:
---------------
var_dump(preg_split('/\R| /m', '翶耆倈者耈翶耆倈傀堀X蔀耖耄'));

Expected result:
----------------
array(1) {
  [0]=>
  string(39) "翶耆倈者耈翶耆倈傀堀X蔀耖耄"
}

Actual result:
--------------
array(2) {
  [0]=>
  string(11) "翶耆倈�"
  [1]=>
  string(28) "耈翶耆倈傀堀X蔀耖耄"
}

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2020-02-07 08:00 UTC] requinix@php.net
-Status: Open +Status: Not a bug
 [2020-02-07 08:00 UTC] requinix@php.net
You must use UTF-8 mode when working with UTF-8 patterns or inputs. https://3v4l.org/lI7FX
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Wed Jan 15 12:01:29 2025 UTC