php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #53823 preg_replace: * qualifier on unicode replace garbles the string
Submitted: 2011-01-23 18:00 UTC Modified: 2015-06-23 15:17 UTC
Votes:2
Avg. Score:3.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:0 (0.0%)
Same OS:1 (100.0%)
From: keith at chaos-realm dot net Assigned: cmb
Status: Closed Package: PCRE related
PHP Version: 5.6.9 OS: *
Private report: No CVE-ID:
 [2011-01-23 18:00 UTC] keith at chaos-realm dot net
Description:
------------
When using the following test script to strip out all unicode except for letters the string becomes garbled when the * qualifier is added, the only surviving character that is intact is ú.

Also, if you add \pN to the exceptions it additionally preserves the ó.

Verified on 5.2,5.3 and 5.3-SNAP.


Test script:
---------------
echo preg_replace('/[^\pL\pM]*/iu', '', 'áéíóú');
or
echo preg_replace('/[^\pL\pM\pN]*/iu', '', 'áéíóú');

Expected result:
----------------
áéíóú

Actual result:
--------------
����ú
or 
���óú (if \pN is added to the exceptions).

Patches

bug53823.phpt (last revision 2012-02-25 09:53 UTC) by robertbasic dot com at gmail dot com)

Add a Patch

Pull Requests

Pull requests:

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2011-01-23 18:04 UTC] keith at chaos-realm dot net
-Summary: preg_replace: * qualifier on unicode replace garbacles the string +Summary: preg_replace: * qualifier on unicode replace garbles the string
 [2011-01-23 18:04 UTC] keith at chaos-realm dot net
.
 [2011-01-23 18:09 UTC] tino dot didriksen at gmail dot com
A workaround is to use + instead of *.

These work as expected:
echo preg_replace('/[^\pL\pM]*/iu', '', 'áéíóú');
echo preg_replace('/[^\pL\pM\pN]*/iu', '', 'áéíóú');
 [2011-01-23 18:10 UTC] tino dot didriksen at gmail dot com
...and then I forget to change the *. Let's try that again...

These work as expected:
echo preg_replace('/[^\pL\pM]+/iu', '', 'áéíóú');
echo preg_replace('/[^\pL\pM\pN]+/iu', '', 'áéíóú');
 [2011-01-23 22:51 UTC] felipe@php.net
-Package: Unicode Engine related +Package: PCRE related
 [2011-01-26 08:02 UTC] aharvey@php.net
-Status: Open +Status: Verified
 [2011-01-26 08:02 UTC] aharvey@php.net
Verified on 5.3 and trunk.
 [2012-02-24 23:33 UTC] robertbasic dot com at gmail dot com
I tried my best on this one. Tested against the trunk:
svn info | grep Revision
Revision: 323476

I created a test file for this, will attach.

I ran the following with gdb:

$ gdb sapi/cgi/php-cgi

and then set a breakpoint

(gdb) break php_pcre.c:1318

finally ran the test script like:

(gdb) run run-tests.php ext/pcre/tests/bug53823.phpt

On https://gist.github.com/1904467 I c/p-ed some output from gdb, but that might be incorrect as I'm fairly new to all this. Anyway, lines 12 and 22 in that gist caught my attention.

Also, I think the same issue exists for preg_filter, too.
 [2012-02-25 09:54 UTC] robertbasic dot com at gmail dot com
Updated the test case showing that preg_filter and preg_replace_callback are affected, too.
 [2014-12-16 11:35 UTC] nhahtdh at gmail dot com
This should be a duplicate to https://bugs.php.net/bug.php?id=66121, since the underlying cause is the same.

After matching empty string at the beginning (index 0) and replace it with empty string, the function will try to match at index 0 again but pass a flag to assert non-empty string match, which it obviously fails. Then the function advance the offset by 1 data unit (1 byte in this case) and hilarity ensues.

The correct behavior is that when `u` modifier is used, the function should always advance by code unit (1 UTF character).
 [2015-06-05 01:15 UTC] cmb@php.net
-Status: Verified +Status: Analyzed -Operating System: Linux +Operating System: * -PHP Version: 5.3SVN-2011-01-23 (snap) +PHP Version: 5.6.9 -Assigned To: +Assigned To: cmb
 [2015-06-05 01:15 UTC] cmb@php.net
Confirmed: <http://3v4l.org/ueAJv>.

> The correct behavior is that when `u` modifier is used, the
> function should always advance by code unit (1 UTF character).

Exactly. Thanks for analyzing the issue. :)
 [2015-06-05 12:03 UTC] cmb@php.net
Related to bug #27103.
 [2015-06-05 13:13 UTC] cmb@php.net
-Assigned To: cmb +Assigned To:
 [2015-06-05 13:13 UTC] cmb@php.net
I've added a PR, that also solves #66121.
 [2015-06-23 15:17 UTC] cmb@php.net
-Assigned To: +Assigned To: cmb
 [2015-06-23 17:44 UTC] cmb@php.net
Automatic comment on behalf of cmbecker69@gmx.de
Revision: http://git.php.net/?p=php-src.git;a=commit;h=23e25f3319db021298310fb97cf537bcef4095ad
Log: Fixed Bug #53823 (preg_replace: * qualifier on unicode replace garbles the string)
 [2015-06-23 17:44 UTC] cmb@php.net
-Status: Analyzed +Status: Closed
 [2015-07-07 23:37 UTC] ab@php.net
Automatic comment on behalf of cmbecker69@gmx.de
Revision: http://git.php.net/?p=php-src.git;a=commit;h=23e25f3319db021298310fb97cf537bcef4095ad
Log: Fixed Bug #53823 (preg_replace: * qualifier on unicode replace garbles the string)
 
PHP Copyright © 2001-2017 The PHP Group
All rights reserved.
Last updated: Tue Aug 29 15:01:52 2017 UTC