php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #74132 preg_match_all hitting backtrack limit in master and not in released versions
Submitted: 2017-02-20 04:14 UTC Modified: 2018-03-21 13:43 UTC
From: liyan at bianhua8 dot com Assigned: cmb (profile)
Status: Closed Package: PCRE related
PHP Version: 7.1Git-2017-02-20 (Git) OS: ubuntu
Private report: No CVE-ID: None
 [2017-02-20 04:14 UTC] liyan at bianhua8 dot com
Description:
------------
after call preg_match_all function.
the third parameter $matches matched a array.
but the return value is false.


Test script:
---------------
<?php
$content = '<th class="title">
    <div class="subject">
        <a href="158391487256182472-1-1.html" title="Jomashop:Swarovski"></a>
    </div>
</th>
<td class="author">
    <a href="profile-42357964-1.html" title="author"><span>cat</span></a>
    <span>2017-02-16</span>
</td>
<div class="link0 list-side-hd moderator-noborder obj2subject"><h3 class="fl"></h3></div><ul class="list-side-bd list-recommend list-recommend2 link1"><li><a target="_blank" href="//go.cqmmgo.com/forum-462505-thread-94061456905358538-1-1.html"><img width="120" height="160" title="#" alt="#" src="//att3.citysbs.com/120x120/chongqing/2016/03/09/09/160x120-092826_v2_11861457486906377_58b326482e2aa1f1da0f8455ac42187f.jpg"></a></li><li><a target="_blank" href="//go.cqmmgo.com/forum-462505-thread-179021457000027333-1-1.html"><img width="120" height="160" title="#" alt="#" src="//att3.citysbs.com/120x120/chongqing/2016/03/09/09/160x120-092826_v2_20771457486906773_31eb80d49e9ef06418003e577cc1453f.jpg"></a></li><li><a target="_blank" href="//go.cqmmgo.com/forum-462505-thread-93091456983256969-1-1.html"><img width="120" height="160" title="#" alt="#" src="//att3.citysbs.com/120x120/chongqing/2016/03/09/09/160x120-092827_v2_20231457486907098_05bc3d4608c961af0660722e03049e20.jpg"><span>abc</span></a></li><li><a target="_blank" href="//go.cqmmgo.com/forum-462505-thread-13801456844284137-1-1.html"><img width="120" height="160" title="#" alt="#" src="//att3.citysbs.com/120x120/chongqing/2016/03/09/09/160x120-092827_v2_14121457486907436_d49c002ba8675553c5436e80f305aca7.jpg"><span>xxxxxxxxxxxxxxxxxxxxxxxxxxxxx</span></a></li></ul></div>';
$ret = preg_match_all('/subject[\s\S]+?href="(.+?)"[\s\S]+?title="([\s\S]+?)"[\s\S]+?profile/', $content, $matches);

var_dump($matches); // have array result
var_dump($ret); // bug the return value is false


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2017-02-20 08:18 UTC] requinix@php.net
-Status: Open +Status: Feedback
 [2017-02-20 08:18 UTC] requinix@php.net
Seems to be working fine. https://3v4l.org/ApKvQ
 [2017-02-20 09:14 UTC] liyan at bianhua8 dot com
-Status: Feedback +Status: Open
 [2017-02-20 09:14 UTC] liyan at bianhua8 dot com
thanks,
copy bellow script to https://3v4l.org/ApKvQ, don't format the style, the bug can recur.

the outputs:
array(3) {
  [0]=>
  array(1) {
    [0]=>
    string(147) "subject">
        <a href="158391487256182472-1-1.html" title="Jomashop:Swarovski"></a>
    </div>
</th>
<td class="author">
    <a href="profile"
  }
  [1]=>
  array(1) {
    [0]=>
    string(27) "158391487256182472-1-1.html"
  }
  [2]=>
  array(1) {
    [0]=>
    string(20) "Jomashop:Swarovski"
  }
}
bool(false)
 [2017-02-20 09:59 UTC] requinix@php.net
-Status: Open +Status: Feedback
 [2017-02-20 09:59 UTC] requinix@php.net
Thank you for this bug report. To properly diagnose the problem, we
need a short but complete example script to be able to reproduce
this bug ourselves. 

A proper reproducing script starts with <?php and ends with ?>,
is max. 10-20 lines long and does not require any external 
resources such as databases, etc. If the script requires a 
database to demonstrate the issue, please make sure it creates 
all necessary tables, stored procedures etc.

Please avoid embedding huge scripts into the report.

Copy what?
 [2017-02-20 12:02 UTC] liyan at bianhua8 dot com
-Status: Feedback +Status: Open
 [2017-02-20 12:02 UTC] liyan at bianhua8 dot com
follow is the test script, same as my first post.

<?php
$content = '<th class="title">
    <div class="subject">
        <a href="158391487256182472-1-1.html" title="Jomashop:Swarovski"></a>
    </div>
</th>
<td class="author">
    <a href="profile-42357964-1.html" title="author"><span>cat</span></a>
    <span>2017-02-16</span>
</td>
<div class="link0 list-side-hd moderator-noborder obj2subject"><h3 class="fl"></h3></div><ul class="list-side-bd list-recommend list-recommend2 link1"><li><a target="_blank" href="//go.cqmmgo.com/forum-462505-thread-94061456905358538-1-1.html"><img width="120" height="160" title="#" alt="#" src="//att3.citysbs.com/120x120/chongqing/2016/03/09/09/160x120-092826_v2_11861457486906377_58b326482e2aa1f1da0f8455ac42187f.jpg"></a></li><li><a target="_blank" href="//go.cqmmgo.com/forum-462505-thread-179021457000027333-1-1.html"><img width="120" height="160" title="#" alt="#" src="//att3.citysbs.com/120x120/chongqing/2016/03/09/09/160x120-092826_v2_20771457486906773_31eb80d49e9ef06418003e577cc1453f.jpg"></a></li><li><a target="_blank" href="//go.cqmmgo.com/forum-462505-thread-93091456983256969-1-1.html"><img width="120" height="160" title="#" alt="#" src="//att3.citysbs.com/120x120/chongqing/2016/03/09/09/160x120-092827_v2_20231457486907098_05bc3d4608c961af0660722e03049e20.jpg"><span>abc</span></a></li><li><a target="_blank" href="//go.cqmmgo.com/forum-462505-thread-13801456844284137-1-1.html"><img width="120" height="160" title="#" alt="#" src="//att3.citysbs.com/120x120/chongqing/2016/03/09/09/160x120-092827_v2_14121457486907436_d49c002ba8675553c5436e80f305aca7.jpg"><span>xxxxxxxxxxxxxxxxxxxxxxxxxxxxx</span></a></li></ul></div>';
$ret = preg_match_all('/subject[\s\S]+?href="(.+?)"[\s\S]+?title="([\s\S]+?)"[\s\S]+?profile/', $content, $matches);

var_dump($matches); // have array result
var_dump($ret); // bug the return value is false
?>
 [2017-02-20 13:31 UTC] requinix@php.net
-Summary: preg_match_all return value error. +Summary: preg_match_all hitting backtrack limit in master and not in released versions
 [2017-02-20 13:31 UTC] requinix@php.net
It looks like libpcre in master is hitting the backtrack limit. preg_match_all() returning false means there was an error during matching, but it can still "return" any matches that were found. JIT on or off doesn't seem to make a difference here.

I don't know why master cannot complete the matching - libpcre hasn't been upgraded since a year ago...


Solution:
Besides increasing the backtrack limit, you can make a simple change to your regex. The problem is that PCRE will try a second match starting at 
  <div class="link0 list-side-hd moderator-noborder obj2subject">
                                                        ^
and reach the backtrack limit before it fails to match.

If I modify the regex as
  /"subject"[\s\S]+?...
then it matches correctly and returns successfully for me.


Side comment: don't use regular expressions to parse HTML. Use DOM.
 [2017-02-20 13:35 UTC] andrew dot nester dot dev at gmail dot com
as I can see from this code you are really receiving error. It's PHP_PCRE_BACKTRACK_LIMIT_ERROR
you can set higher backtrack limit like this and you'll be fine (tested on my end)
ini_set("pcre.backtrack_limit", "10000000")
 [2017-02-21 06:05 UTC] liyan at bianhua8 dot com
-Status: Open +Status: Closed
 [2017-02-21 06:05 UTC] liyan at bianhua8 dot com
i see, thanks requinix and andrew a lot :)
 [2017-02-21 06:19 UTC] requinix@php.net
-Status: Closed +Status: Re-Opened
 [2017-02-21 06:19 UTC] requinix@php.net
I'm not convinced this isn't a bug. Certainly it might not be, but something changed and I'm not sure it wasn't upstream in libpcre or that it was intentional.
 [2018-03-21 13:43 UTC] cmb@php.net
-Status: Re-Opened +Status: Closed -Assigned To: +Assigned To: cmb
 [2018-03-21 13:43 UTC] cmb@php.net
<https://3v4l.org/ApKvQ> works fine, so I assume this has been a
temporary issue.
 
PHP Copyright © 2001-2019 The PHP Group
All rights reserved.
Last updated: Wed Dec 11 02:01:23 2019 UTC