php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #70232 Incorrect bump-along behavior with \K and empty string match
Submitted: 2015-08-11 04:57 UTC Modified: 2015-08-13 11:29 UTC
Votes:1
Avg. Score:4.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:1 (100.0%)
Same OS:0 (0.0%)
From: nhahtdh at gmail dot com Assigned: cmb (profile)
Status: Closed Package: PCRE related
PHP Version: 5.6.12 OS:
Private report: No CVE-ID: None
 [2015-08-11 04:57 UTC] nhahtdh at gmail dot com
Description:
------------
Please see the test case, the expected result and the actual result in the bug report below.

From analyzing the code, I have identified that the error comes from the following line of code (plus similar code in other functions): https://github.com/php/php-src/blob/0787cd60ed3d0c8c8c8ff7e49b9bb3587bf33b64/ext/pcre/php_pcre.c#L891

	g_notempty = (offsets[1] == offsets[0])? PCRE_NOTEMPTY | PCRE_ANCHORED : 0;

	/* Advance to the position right after the last full match */
	start_offset = offsets[1];

This part of the code checks whether the match is empty. If the match it empty, it tries to perform another match from the last position of the match with PCRE_NOTEMPTY flag to force the match to be non-empty.

However, it's incorrect to set the PCRE_NOTEMPTY flag by checking the length of the match based on the start and end indices of the match. For the regex in the test case ~(?: |\G)\d\B\K~ (which always results in empty string matches due to \K), the index where we execute the match from (start_offset) and the start index of the match (offsets[0]) are always different. Therefore, the regex actually advances ahead in the string, and setting PCRE_NOTEMPTY in such case rejects valid empty string matches.

I believe (offsets[1] == offsets[0]) should be changed to (offsets[1] == start_offset). If start_offset <= offsets[0] <= offsets[1] is an invariant, then the change should be correct.

Original report and analysis: http://chat.stackoverflow.com/transcript/message/24995536#24995536

Test script:
---------------
<?
$str = "123 a123 1234567 b123 123";
$str = preg_replace('~(?: |\G)\d\B\K~', "*", $str);
echo $str;
?>

Expected result:
----------------
1*2*3 a123 1*2*3*4*5*6*7 b123 1*2*3

The expected result (10 matches/replacements) can be observed on

- pcretest with PCRE version 8.35 2014-04-04
- regex101 https://regex101.com/r/oP9mZ7/3

Actual result:
--------------
1*23 a123 1*23*45*67 b123 1*23

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2015-08-11 12:40 UTC] cmb@php.net
-Status: Open +Status: Verified
 [2015-08-11 12:40 UTC] cmb@php.net
I can confirm the issue, and that this case would be solved by
your suggestion (checking for offsets[1] == start_offset).
However, that would break some other cases, e.g.

    preg_replace('/\b/', '*', '(#11/19/2002#)')
 
Expected:
    string(20) "(#*11*/*19*/*2002*#)"

Actual:
    string(24) "(#**11**/*19**/*2002**#)"
 [2015-08-11 14:49 UTC] nhahtdh at gmail dot com
Oh, you are right. I forgot that the exec function bumps the match along automatically, so relying on start_offset is a bad idea.

Looking at the source code of pcretest, it seems that it uses a different option. Can you check whether the option is usable?

http://vcs.pcre.org/pcre/code/branches/pcre16/pcretest.c?view=markup&pathrev=829#l4233

4233 	      if (use_offsets[0] == use_offsets[1])
4234 	        {
4235 	        if (use_offsets[0] == len) break;
4236 	        g_notempty = PCRE_NOTEMPTY_ATSTART | PCRE_ANCHORED;
4237 	        }
 [2015-08-11 16:33 UTC] cmb@php.net
-Status: Verified +Status: Analyzed -Assigned To: +Assigned To: cmb
 [2015-08-11 16:33 UTC] cmb@php.net
Great, thanks! PCRE_NOTEMPTY_ATSTART is exactly what is needed[1].
It's available as of PCRE 8.00 (2009-10-19)[2], so maybe we need
to have a workaround for older libpcre, even though this bug would
not be fixed for such versions.

[1] <https://lists.exim.org/lurker/message/20090911.102109.6e80cce4.pt-BR.html>
[2] <http://www.pcre.org/original/changelog.txt>
 [2015-08-12 08:42 UTC] nhahtdh at gmail dot com
For old version, I guess the bug can be fixed by using PCRE_ANCHORED all time, so that we take full control of the bump-along. We then can rely on (offsets[1] == start_offset) to check that the match has advanced.

As for work-around, I'm not sure whether a general work-around exists. It should probably be handled on case-by-case basis.
 [2015-08-13 11:29 UTC] cmb@php.net
> For old version, I guess the bug can be fixed by using
> PCRE_ANCHORED all time, so that we take full control of the
> bump-along. We then can rely on (offsets[1] == start_offset) to
> check that the match has advanced.

I gave that a quick try, and got 35 failing tests. I have not
investigated further, as it seems to me that PCRE_NOTEMPTY_ATSTART
would not have been introduced, if a general fix would have been
so simple.

I still prefer to use PCRE_NOTEMPTY_ATSTART instead of
PCRE_NOTEMPTY, if it's available, and to simply fall back to
PCRE_NOTEMPTY otherwise. This will leave this bug unresolved for
PCRE >= 7.2 and < 8.0, but any user who is bitten by this may
update their libpcre.

After some internal discussion[1] it turned out that lifting the
requirements to libpcre >= 8.0 is not an option for now.

[1] <http://markmail.org/thread/go2fprpyzq44invj>
 [2015-08-13 12:30 UTC] cmb@php.net
Automatic comment on behalf of cmb
Revision: http://git.php.net/?p=php-src.git;a=commit;h=b9f23c2152eb635082e43e62a5c395b16f40054e
Log: Fix #70232: Incorrect bump-along behavior with \K and empty string match
 [2015-08-13 12:30 UTC] cmb@php.net
-Status: Analyzed +Status: Closed
 [2015-08-18 16:24 UTC] ab@php.net
Automatic comment on behalf of cmb
Revision: http://git.php.net/?p=php-src.git;a=commit;h=b9f23c2152eb635082e43e62a5c395b16f40054e
Log: Fix #70232: Incorrect bump-along behavior with \K and empty string match
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 08:01:29 2024 UTC