php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #27103 preg_split('//u' ...) splits into octets, not UTF-8 characters.
Submitted: 2004-01-31 07:16 UTC Modified: 2004-01-31 23:15 UTC
From: Aidan Kehoe <php-manual at parhasard dot net> Assigned:
Status: Closed Package: PCRE related
PHP Version: 4CVS,5CVS OS: *
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: Aidan Kehoe <php-manual at parhasard dot net>
New email:
PHP Version: OS:

 

 [2004-01-31 07:16 UTC] Aidan Kehoe <php-manual at parhasard dot net>
Description:
------------
http://php.net/manual/en/pcre.pattern.modifiers.php states that the /u modifier "... turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8."

The PCRE documentation itself says "In order process UTF-8 strings, you must build PCRE to include UTF-8 support in the code, and, in addition, you must call pcre_compile() with the PCRE_UTF8 option flag. When you do this, both the pattern and any subject strings that are matched against it are treated as UTF-8 strings instead of just strings of bytes." This says, to me, that the /u modifier in our PCRE expressions maps to the PCRE_UTF8 option flag in the C. 

And, sure enough, preg_match_all('/./u', $string, $matches) puts an array of all the UTF-8 characters in $string into $matches[0]. 

preg_split('//u', $string) then, by this logic, should return an array containing the UTF-8 characters in $string. It doesn't--it returns instead an array of the octets in $string. 

Reproduce code:
---------------
#!/usr/pkg/bin/php
<?php
/* The Euro sign--U+20AC--followed by " hi there", 
   in UTF-8. */
$teststr = "\xe2\x82\xac hi there";

/* Split it into individual characters, passing the /u flag
   to tell PCRE to interpret the string as UTF-8. */
$testchars = preg_split('//u', $teststr, -1, PREG_SPLIT_NO_EMPTY);

/* Get some output that should be equivalent. */
preg_match_all('/./u', $teststr, $matches);
$goodtestchars = $matches[0];

/* The arrays should be the same length. */ 
print "This should be 1: '".(count($testchars) 
        == count($goodtestchars))."'\n";

/* And the octet count of the first entry should be 
   three for both arrays. */
print 'These both should be three: '; 
print strlen($testchars[0]).', '.strlen($goodtestchars[0]).
        "\n";

 ?>


Expected result:
----------------
$ ./testing.php
This should be 1: '1'
These both should be three: 3, 3
$ 


Actual result:
--------------
$ ./testing.php
This should be 1: ''
These both should be three: 1, 3
$ 

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2004-01-31 17:35 UTC] moriyoshi@php.net
This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.

The fix will be in PHP 5.0.

 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Sat Jan 04 22:01:28 2025 UTC