PHP :: Bug #27103 :: preg_split('//u' ...) splits into octets, not UTF-8 characters.

Bug #27103	preg_split('//u' ...) splits into octets, not UTF-8 characters.
Submitted:	2004-01-31 07:16 UTC	Modified:	2004-01-31 23:15 UTC
From:	Aidan Kehoe <php-manual at parhasard dot net>	Assigned:
Status:	Closed	Package:	PCRE related
PHP Version:	4CVS,5CVS	OS:	*
Private report:	No	CVE-ID:	None

View Developer Edit

Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.

Password:

Status:
Package:
Bug Type:
Summary:
From:	Aidan Kehoe <php-manual at parhasard dot net>
New email:
PHP Version:		OS:

New Comment:

[2004-01-31 07:16 UTC] Aidan Kehoe <php-manual at parhasard dot net>

Description:
------------
http://php.net/manual/en/pcre.pattern.modifiers.php states that the /u modifier "... turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8."

The PCRE documentation itself says "In order process UTF-8 strings, you must build PCRE to include UTF-8 support in the code, and, in addition, you must call pcre_compile() with the PCRE_UTF8 option flag. When you do this, both the pattern and any subject strings that are matched against it are treated as UTF-8 strings instead of just strings of bytes." This says, to me, that the /u modifier in our PCRE expressions maps to the PCRE_UTF8 option flag in the C. 

And, sure enough, preg_match_all('/./u', $string, $matches) puts an array of all the UTF-8 characters in $string into $matches[0]. 

preg_split('//u', $string) then, by this logic, should return an array containing the UTF-8 characters in $string. It doesn't--it returns instead an array of the octets in $string. 

Reproduce code:
---------------
#!/usr/pkg/bin/php
<?php
/* The Euro sign--U+20AC--followed by " hi there", 
   in UTF-8. */
$teststr = "\xe2\x82\xac hi there";

/* Split it into individual characters, passing the /u flag
   to tell PCRE to interpret the string as UTF-8. */
$testchars = preg_split('//u', $teststr, -1, PREG_SPLIT_NO_EMPTY);

/* Get some output that should be equivalent. */
preg_match_all('/./u', $teststr, $matches);
$goodtestchars = $matches[0];

/* The arrays should be the same length. */ 
print "This should be 1: '".(count($testchars) 
        == count($goodtestchars))."'\n";

/* And the octet count of the first entry should be 
   three for both arrays. */
print 'These both should be three: '; 
print strlen($testchars[0]).', '.strlen($goodtestchars[0]).
        "\n";

 ?>


Expected result:
----------------
$ ./testing.php
This should be 1: '1'
These both should be three: 3, 3
$ 


Actual result:
--------------
$ ./testing.php
This should be 1: ''
These both should be three: 1, 3
$

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2004-01-31 17:35 UTC] moriyoshi@php.net

This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.

The fix will be in PHP 5.0.

[2015-06-05 12:03 UTC] cmb@php.net

Related To: Bug #53823

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2026 The PHP Group All rights reserved.	Last updated: Tue Jul 28 12:00:02 2026 UTC