PHP :: Bug #67487 :: PREG_SPLIT_OFFSET

Bug #67487	PREG_SPLIT_OFFSET_CAPTURE and UTF-8
Submitted:	2014-06-20 10:00 UTC	Modified:	2014-06-20 13:28 UTC
From:	test at test dot com	Assigned:
Status:	Not a bug	Package:	PCRE related
PHP Version:	5.5.13	OS:	windows
Private report:	No	CVE-ID:	None

View Developer Edit

[2014-06-20 10:00 UTC] test at test dot com

Description:
------------
I quote :

PREG_SPLIT_OFFSET_CAPTURE

     "[...] the return value [is] an array where every element is an array consisting of the matched string at offset 0 and its string offset into subject at offset 1."

The "string offset" is wrong when the subject string is encoded in UTF-8.
Maybe the function is using strlen() internally, instead of using mb_strlen() ?

Note that I use the "u" modifier in the regex.

Test script:
---------------
<?php
header("Content-Type: text/plain; charset=utf-8");
var_dump(preg_split('# #u', 'à é ù', 0, PREG_SPLIT_OFFSET_CAPTURE));
?>

Actual result:
--------------
array(3) {
  [0]=>
  array(2) {
    [0]=>
    string(2) "à"
    [1]=>
    int(0)
  }
  [1]=>
  array(2) {
    [0]=>
    string(2) "é"
    [1]=>
    int(3)
  }
  [2]=>
  array(2) {
    [0]=>
    string(2) "ù"
    [1]=>
    int(6)
  }
}

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2014-06-20 13:10 UTC] johannes@php.net

-Status: Open +Status: Not a bug

[2014-06-20 13:10 UTC] johannes@php.net

This is consistent with PHP strings. A PHP string is an array of bytes, not unicode characters.

[2014-06-20 13:28 UTC] test at test dot com

-: alexandre at abrioux dot fr +: test at test dot com

[2014-06-20 13:28 UTC] test at test dot com

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Sun Dec 21 17:00:01 2025 UTC