php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #37391 PREG_OFFSET_CAPTURE not UTF-8 aware when using u modifier
Submitted: 2006-05-09 22:57 UTC Modified: 2006-05-10 07:03 UTC
From: mike at silverorange dot com Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 5.1.4 OS: Linux
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: mike at silverorange dot com
New email:
PHP Version: OS:

 

 [2006-05-09 22:57 UTC] mike at silverorange dot com
Description:
------------
When using preg_match_all() with the PREG_OFFSET_CAPTURE flag, the returned match offsets are in octets rather than characters.

PCRE is compiled with --enable-utf8 and I am using the u modifier in my regular expression.


Reproduce code:
---------------
<?php
$matches = array();
$reg_exp = "/B/u";
// UTF8 represents A-euro-BC
$string = "A\xe2\x82\xacBC"; 
preg_match_all($reg_exp, $string, $matches, PREG_OFFSET_CAPTURE);
print_r($matches);
?>

Expected result:
----------------
Array
(
    [0] => Array
        (
            [0] => Array
                (
                    [0] => B
                    [1] => 2
                )
        )
)

Actual result:
--------------
Array
(
    [0] => Array
        (
            [0] => Array
                (
                    [0] => B
                    [1] => 4
                )
        )
)

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2006-05-10 07:03 UTC] derick@php.net
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

.
 [2012-03-19 00:00 UTC] harald dot lapp at gmail dot com
I am not sure, where the manual mentions, that PREG_OFFSET_CAPTURE is not "UTF-8" 
aware. And even if it was, it is still very, very, very annoying, Any chances, 
that this behaviour could get changed?
 [2018-10-01 23:30 UTC] arnold at jasny dot net
There is no obvious way to get the character offset with preg_match_all().

Note that this bug is the pun of UTF-8 failure in PHP. Just for that, it should be solved. https://stackoverflow.com/a/1725329/1160754
 [2019-05-23 18:10 UTC] phpbugreport888 at allanid dot com
I fully agree with arnold at jasny dot net.

Why can't we have some way to work with UTF-8 strings? Like another flag or some usable workaround? I just can't imagine that wouldn't be doable? I would like an answer as to why, not just a reference to the manual that doesn't even mention the issue.
 [2020-10-01 12:04 UTC] thomas at landauer dot at
I opened a new bug for this - maybe we have more luck this time :-) https://bugs.php.net/bug.php?id=80166
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Nov 22 18:01:31 2024 UTC