php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #67487 PREG_SPLIT_OFFSET_CAPTURE and UTF-8
Submitted: 2014-06-20 10:00 UTC Modified: 2014-06-20 13:28 UTC
From: test at test dot com Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 5.5.13 OS: windows
Private report: No CVE-ID: None
 [2014-06-20 10:00 UTC] test at test dot com
Description:
------------
I quote :

PREG_SPLIT_OFFSET_CAPTURE

     "[...] the return value [is] an array where every element is an array consisting of the matched string at offset 0 and its string offset into subject at offset 1."

The "string offset" is wrong when the subject string is encoded in UTF-8.
Maybe the function is using strlen() internally, instead of using mb_strlen() ?

Note that I use the "u" modifier in the regex.

Test script:
---------------
<?php
header("Content-Type: text/plain; charset=utf-8");
var_dump(preg_split('# #u', 'à é ù', 0, PREG_SPLIT_OFFSET_CAPTURE));
?>

Actual result:
--------------
array(3) {
  [0]=>
  array(2) {
    [0]=>
    string(2) "à"
    [1]=>
    int(0)
  }
  [1]=>
  array(2) {
    [0]=>
    string(2) "é"
    [1]=>
    int(3)
  }
  [2]=>
  array(2) {
    [0]=>
    string(2) "ù"
    [1]=>
    int(6)
  }
}

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2014-06-20 13:10 UTC] johannes@php.net
-Status: Open +Status: Not a bug
 [2014-06-20 13:10 UTC] johannes@php.net
This is consistent with PHP strings. A PHP string is an array of bytes, not unicode characters.
 [2014-06-20 13:28 UTC] test at test dot com
-: alexandre at abrioux dot fr +: test at test dot com
 [2014-06-20 13:28 UTC] test at test dot com
.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Tue May 07 11:01:31 2024 UTC