php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #37391 PREG_OFFSET_CAPTURE not UTF-8 aware when using u modifier
Submitted: 2006-05-09 22:57 UTC Modified: 2006-05-10 07:03 UTC
From: mike at silverorange dot com Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 5.1.4 OS: Linux
Private report: No CVE-ID: None
 [2006-05-09 22:57 UTC] mike at silverorange dot com
Description:
------------
When using preg_match_all() with the PREG_OFFSET_CAPTURE flag, the returned match offsets are in octets rather than characters.

PCRE is compiled with --enable-utf8 and I am using the u modifier in my regular expression.


Reproduce code:
---------------
<?php
$matches = array();
$reg_exp = "/B/u";
// UTF8 represents A-euro-BC
$string = "A\xe2\x82\xacBC"; 
preg_match_all($reg_exp, $string, $matches, PREG_OFFSET_CAPTURE);
print_r($matches);
?>

Expected result:
----------------
Array
(
    [0] => Array
        (
            [0] => Array
                (
                    [0] => B
                    [1] => 2
                )
        )
)

Actual result:
--------------
Array
(
    [0] => Array
        (
            [0] => Array
                (
                    [0] => B
                    [1] => 4
                )
        )
)

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2006-05-10 07:03 UTC] derick@php.net
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

.
 [2012-03-19 00:00 UTC] harald dot lapp at gmail dot com
I am not sure, where the manual mentions, that PREG_OFFSET_CAPTURE is not "UTF-8" 
aware. And even if it was, it is still very, very, very annoying, Any chances, 
that this behaviour could get changed?
 [2018-10-01 23:30 UTC] arnold at jasny dot net
There is no obvious way to get the character offset with preg_match_all().

Note that this bug is the pun of UTF-8 failure in PHP. Just for that, it should be solved. https://stackoverflow.com/a/1725329/1160754
 [2019-05-23 18:10 UTC] phpbugreport888 at allanid dot com
I fully agree with arnold at jasny dot net.

Why can't we have some way to work with UTF-8 strings? Like another flag or some usable workaround? I just can't imagine that wouldn't be doable? I would like an answer as to why, not just a reference to the manual that doesn't even mention the issue.
 [2020-10-01 12:04 UTC] thomas at landauer dot at
I opened a new bug for this - maybe we have more luck this time :-) https://bugs.php.net/bug.php?id=80166
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Tue Mar 19 07:01:29 2024 UTC