php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #67487 PREG_SPLIT_OFFSET_CAPTURE and UTF-8
Submitted: 2014-06-20 10:00 UTC Modified: 2014-06-20 13:28 UTC
From: test at test dot com Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 5.5.13 OS: windows
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: test at test dot com
New email:
PHP Version: OS:

 

 [2014-06-20 10:00 UTC] test at test dot com
Description:
------------
I quote :

PREG_SPLIT_OFFSET_CAPTURE

     "[...] the return value [is] an array where every element is an array consisting of the matched string at offset 0 and its string offset into subject at offset 1."

The "string offset" is wrong when the subject string is encoded in UTF-8.
Maybe the function is using strlen() internally, instead of using mb_strlen() ?

Note that I use the "u" modifier in the regex.

Test script:
---------------
<?php
header("Content-Type: text/plain; charset=utf-8");
var_dump(preg_split('# #u', 'à é ù', 0, PREG_SPLIT_OFFSET_CAPTURE));
?>

Actual result:
--------------
array(3) {
  [0]=>
  array(2) {
    [0]=>
    string(2) "à"
    [1]=>
    int(0)
  }
  [1]=>
  array(2) {
    [0]=>
    string(2) "é"
    [1]=>
    int(3)
  }
  [2]=>
  array(2) {
    [0]=>
    string(2) "ù"
    [1]=>
    int(6)
  }
}

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2014-06-20 13:10 UTC] johannes@php.net
-Status: Open +Status: Not a bug
 [2014-06-20 13:10 UTC] johannes@php.net
This is consistent with PHP strings. A PHP string is an array of bytes, not unicode characters.
 [2014-06-20 13:28 UTC] test at test dot com
-: alexandre at abrioux dot fr +: test at test dot com
 [2014-06-20 13:28 UTC] test at test dot com
.
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Thu Jul 03 13:01:33 2025 UTC