php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #41896 preg_replace crashes with large input
Submitted: 2007-07-04 19:03 UTC Modified: 2007-07-04 19:12 UTC
From: giacomoread at hotmail dot com Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 5.2.3 OS: All
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: giacomoread at hotmail dot com
New email:
PHP Version: OS:

 

 [2007-07-04 19:03 UTC] giacomoread at hotmail dot com
Description:
------------
I found a similar bug which was closed with status bogus. Unacceptable! There is nothing in the documentation which states limits to the input of preg_replace or any portable work arounds documented. Stating that 'it is just a stack overflow' just to keep the bug count down is more than a little unprofessional. A scripting language should either make the workaround internal or document input limits NOT cause seg faults. This is a bug whether the php community is willing to accept it or not.

Reproduce code:
---------------
function parse($html, &$title, &$text, &$anchors)
{
  $pstring1 = "'[^']*'";
  $pstring2 = '"[^"]*"';
  $pnstring = "[^'\">]";
  $pintag   = "(?:$pstring1|$pstring2|$pnstring)*";
  $pattrs   = "(?:\\s$pintag){0,1}";

  $pcomment = enclose("<!--", "-", "->");
  $pscript  = enclose("<script$pattrs>", "<", "\\/script>");
  $pstyle   = enclose("<style$pattrs>", "<", "\\/style>");
  $pexclude = "(?:$pcomment|$pscript|$pstyle)";

  $ptitle   = enclose("<title$pattrs>", "<", "\\/title>");
  $panchor  = "<a(?:\\s$pintag){0,1}>";
  $phref    = "href\\s*=[\\s'\"]*([^\\s'\">]*)";

  $html = preg_replace("/$pexclude/iX", " ", $html);

  if ($title !== false)
    $title = preg_match("/$ptitle/iX", $html, $title)
             ? $title[1] : '';

  if ($text !== false)
  {
    $text = preg_replace("/<$pintag>/iX",   " ", $html);
    $text = preg_replace("/\\s+|&nbsp;/iX", " ", $text);
  }

  if ($anchors !== false)
  {
    preg_match_all("/$panchor/iX", $html, $anchors);
    $anchors = $anchors[0];

    reset($anchors);
    while (list($i, $x) = each($anchors))
      $anchors[$i] =
        preg_match("/$phref/iX", $x, $x) ? $x[1] : '';

    $anchors = array_unique($anchors);
  }
}

function enclose($start, $end1, $end2)
{
  return "$start((?:[^$end1]|$end1(?!$end2))*)$end1$end2";
}

Expected result:
----------------
The code should clean the html pages into title, text and links. It works fine until large pages are downloaded. Then it seg faults with gdb showing the blame lying on preg_replace.


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2007-07-04 19:12 UTC] tony2001@php.net
>I found a similar bug which was closed with status bogus. 
Surely it's bogus, since it's not PHP issue.

>There is nothing in the documentation which states limits to the
>input of preg_replace or any portable work arounds documented. 
Right, we can't and we won't document any bugs in third-party libs.

>Stating that 'it is just a stack overflow' just to keep the bug
>count down is more than a little unprofessional. 
"It's just a stack overflow" that happens outside of PHP and we cannot control it. I guess you failed to read the second part of the sentence.

>A scripting language should either make the workaround internal 
>or document input limits NOT cause seg faults.

We do accept patches both to the source code and to the documentation.

>This is a bug whether the php community is willing to accept it or not.
Yes, it's known bug in PCRE.
Please report it to PCRE developers.
 
PHP Copyright © 2001-2022 The PHP Group
All rights reserved.
Last updated: Mon Jan 24 21:03:34 2022 UTC