php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #53823 preg_replace: * qualifier on unicode replace garbles the string
Submitted: 2011-01-23 18:00 UTC Modified: 2015-06-23 15:17 UTC
Votes:2
Avg. Score:3.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:0 (0.0%)
Same OS:1 (100.0%)
From: keith at chaos-realm dot net Assigned: cmb (profile)
Status: Closed Package: PCRE related
PHP Version: 5.6.9 OS: *
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: keith at chaos-realm dot net
New email:
PHP Version: OS:

 

 [2011-01-23 18:00 UTC] keith at chaos-realm dot net
Description:
------------
When using the following test script to strip out all unicode except for letters the string becomes garbled when the * qualifier is added, the only surviving character that is intact is ú.

Also, if you add \pN to the exceptions it additionally preserves the ó.

Verified on 5.2,5.3 and 5.3-SNAP.


Test script:
---------------
echo preg_replace('/[^\pL\pM]*/iu', '', 'áéíóú');
or
echo preg_replace('/[^\pL\pM\pN]*/iu', '', 'áéíóú');

Expected result:
----------------
áéíóú

Actual result:
--------------
����ú
or 
���óú (if \pN is added to the exceptions).

Patches

bug53823.phpt (last revision 2012-02-25 09:53 UTC by robertbasic dot com at gmail dot com)

Pull Requests

Pull requests:

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2011-01-23 18:04 UTC] keith at chaos-realm dot net
-Summary: preg_replace: * qualifier on unicode replace garbacles the string +Summary: preg_replace: * qualifier on unicode replace garbles the string
 [2011-01-23 18:04 UTC] keith at chaos-realm dot net
.
 [2011-01-23 18:09 UTC] tino dot didriksen at gmail dot com
A workaround is to use + instead of *.

These work as expected:
echo preg_replace('/[^\pL\pM]*/iu', '', 'áéíóú');
echo preg_replace('/[^\pL\pM\pN]*/iu', '', 'áéíóú');
 [2011-01-23 18:10 UTC] tino dot didriksen at gmail dot com
...and then I forget to change the *. Let's try that again...

These work as expected:
echo preg_replace('/[^\pL\pM]+/iu', '', 'áéíóú');
echo preg_replace('/[^\pL\pM\pN]+/iu', '', 'áéíóú');
 [2011-01-23 22:51 UTC] felipe@php.net
-Package: Unicode Engine related +Package: PCRE related
 [2011-01-26 08:02 UTC] aharvey@php.net
-Status: Open +Status: Verified
 [2011-01-26 08:02 UTC] aharvey@php.net
Verified on 5.3 and trunk.
 [2012-02-24 23:33 UTC] robertbasic dot com at gmail dot com
I tried my best on this one. Tested against the trunk:
svn info | grep Revision
Revision: 323476

I created a test file for this, will attach.

I ran the following with gdb:

$ gdb sapi/cgi/php-cgi

and then set a breakpoint

(gdb) break php_pcre.c:1318

finally ran the test script like:

(gdb) run run-tests.php ext/pcre/tests/bug53823.phpt

On https://gist.github.com/1904467 I c/p-ed some output from gdb, but that might be incorrect as I'm fairly new to all this. Anyway, lines 12 and 22 in that gist caught my attention.

Also, I think the same issue exists for preg_filter, too.
 [2012-02-25 09:54 UTC] robertbasic dot com at gmail dot com
Updated the test case showing that preg_filter and preg_replace_callback are affected, too.
 [2014-12-16 11:35 UTC] nhahtdh at gmail dot com
This should be a duplicate to https://bugs.php.net/bug.php?id=66121, since the underlying cause is the same.

After matching empty string at the beginning (index 0) and replace it with empty string, the function will try to match at index 0 again but pass a flag to assert non-empty string match, which it obviously fails. Then the function advance the offset by 1 data unit (1 byte in this case) and hilarity ensues.

The correct behavior is that when `u` modifier is used, the function should always advance by code unit (1 UTF character).
 [2015-06-05 01:15 UTC] cmb@php.net
-Status: Verified +Status: Analyzed -Operating System: Linux +Operating System: * -PHP Version: 5.3SVN-2011-01-23 (snap) +PHP Version: 5.6.9 -Assigned To: +Assigned To: cmb
 [2015-06-05 01:15 UTC] cmb@php.net
Confirmed: <http://3v4l.org/ueAJv>.

> The correct behavior is that when `u` modifier is used, the
> function should always advance by code unit (1 UTF character).

Exactly. Thanks for analyzing the issue. :)
 [2015-06-05 12:03 UTC] cmb@php.net
Related to bug #27103.
 [2015-06-05 13:13 UTC] cmb@php.net
-Assigned To: cmb +Assigned To:
 [2015-06-05 13:13 UTC] cmb@php.net
I've added a PR, that also solves #66121.
 [2015-06-23 15:17 UTC] cmb@php.net
-Assigned To: +Assigned To: cmb
 [2015-06-23 17:44 UTC] cmb@php.net
Automatic comment on behalf of cmbecker69@gmx.de
Revision: http://git.php.net/?p=php-src.git;a=commit;h=23e25f3319db021298310fb97cf537bcef4095ad
Log: Fixed Bug #53823 (preg_replace: * qualifier on unicode replace garbles the string)
 [2015-06-23 17:44 UTC] cmb@php.net
-Status: Analyzed +Status: Closed
 [2015-07-07 23:37 UTC] ab@php.net
Automatic comment on behalf of cmbecker69@gmx.de
Revision: http://git.php.net/?p=php-src.git;a=commit;h=23e25f3319db021298310fb97cf537bcef4095ad
Log: Fixed Bug #53823 (preg_replace: * qualifier on unicode replace garbles the string)
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Thu Jan 02 12:01:29 2025 UTC