php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #49568 Regex does not match when text added to matching text
Submitted: 2009-09-16 01:39 UTC Modified: 2009-09-18 19:16 UTC
From: anoop dot john at zyxware dot com Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 5.2.10 OS: Ubuntu Jaunty
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: anoop dot john at zyxware dot com
New email:
PHP Version: OS:

 

 [2009-09-16 01:39 UTC] anoop dot john at zyxware dot com
Description:
------------
I am using a complex regex pattern to match stock tickers in a piece of text. The pattern given below

$pattern = '/\(((?i:\s*[a-z]*\s*[a-z]*\s*,)*\s*(?i:AMEX|NASDAQ|NasdaqGM|NasdaqGS|NYSE)\s*(?i:,\s*[a-z]*\s*[a-z]*\s*)*):\s*([A-Z]+)\s*;((?i:\s*[a-z]*\s*[a-z]*\s*,)*\s*(?i:AMEX|NASDAQ|NasdaqGM|NasdaqGS|NYSE)\s*(?i:,\s*[a-z]*\s*[a-z]*\s*)*):\s*([A-Z]+)\s*\)/';

should match 

(AMEX,NYSE, Swiss Exchange: CRX;Nasdaq: QTWW) 

and it does match it when the subject string is given alone. However when you prepend another particular string that does not match this pattern in front of this subject string the regex ceases to match the original portion of the string. The culprit string is given below.

(Euronext, NASDAQ: CRXL; AMEX,NYSE,NASDAQ, Swiss Exchange: CRX;NasdaqGM: QTWW)

The pattern matches only one opening brace and will not match another opening brace. So it cannot be that the pattern ate through the first pair of brackets and went into the second pair of brackets and fails to match when the culprit string is prepended. 


Reproduce code:
---------------
$pattern = '/\(((?i:\s*[a-z]*\s*[a-z]*\s*,)*\s*(?i:AMEX|NASDAQ|NasdaqGM|NasdaqGS|NYSE)\s*(?i:,\s*[a-z]*\s*[a-z]*\s*)*):\s*([A-Z]+)\s*;((?i:\s*[a-z]*\s*[a-z]*\s*,)*\s*(?i:AMEX|NASDAQ|NasdaqGM|NasdaqGS|NYSE)\s*(?i:,\s*[a-z]*\s*[a-z]*\s*)*):\s*([A-Z]+)\s*\)/';
preg_match_all($pattern, '(Euronext, NASDAQ: CRXL; AMEX,NYSE,NASDAQ, Swiss Exchange: CRX;NasdaqGM: QTWW) (AMEX,NYSE, Swiss Exchange: CRX;Nasdaq: QTWW)', $matches, PREG_SET_ORDER);
var_export($matches);
echo "<br /><br />";
preg_match_all($pattern, '(AMEX,NYSE, Swiss Exchange: CRX;Nasdaq: QTWW)', $matches, PREG_SET_ORDER);
var_export($matches);


Expected result:
----------------
array ( 0 => array ( 0 => '(AMEX,NYSE, Swiss Exchange: CRX;Nasdaq: QTWW)', 1 => 'AMEX,NYSE, Swiss Exchange', 2 => 'CRX', 3 => 'Nasdaq', 4 => 'QTWW', ), )

array ( 0 => array ( 0 => '(AMEX,NYSE, Swiss Exchange: CRX;Nasdaq: QTWW)', 1 => 'AMEX,NYSE, Swiss Exchange', 2 => 'CRX', 3 => 'Nasdaq', 4 => 'QTWW', ), )

Actual result:
--------------
array ( )

array ( 0 => array ( 0 => '(AMEX,NYSE, Swiss Exchange: CRX;Nasdaq: QTWW)', 1 => 'AMEX,NYSE, Swiss Exchange', 2 => 'CRX', 3 => 'Nasdaq', 4 => 'QTWW', ), )

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2009-09-16 12:03 UTC] jani@php.net
And you're 100% sure your pattern is not buggy?
 [2009-09-16 18:16 UTC] anoop dot john at zyxware dot com
I know for sure one thing. The pattern matches only one opening brace and one closing brace. So it cannot start matching with the first pair of brackets and go on matching the second pair of braces in the example given. When it fails with the first pair of braces the matching should restart beginning with the opening brace of the second pair of braces.
 [2009-09-18 13:46 UTC] jani@php.net
Please, simplify the regex to as much as possible. Once you have the simplest case still showing the problem we might be able to say whether it's a bug or not. 
 [2009-09-18 14:25 UTC] anoop dot john at zyxware dot com
I tried taking out conditions from the regular expressions but when I took out the first condition the expression starts giving the expected result. So the symptom appears only for the specific expression and the specific text. 

My logic about the issue seems to be OK.

If pattern 

\(P\) matches (A) returns (A) as matches array

\(P\) does not match (B)

where no part of P can match \( or \) then 

\(P\) should definitely match (B)(A) and return (A) in the matches array
 [2009-09-18 17:55 UTC] jani@php.net
How about fixing your pattern to match 1 or more times? Now it only matches if there's exactly one match.
 [2009-09-18 18:13 UTC] anoop dot john at zyxware dot com
Oh no I don't have a big issue with the bug as far as my application's needs are concerned. The example was only a use case I tried while testing the regex. I reported the bug (if it is indeed one) so that you can fix it (if it is worth fixing) for everybody's sake :-)
 [2009-09-18 18:42 UTC] jani@php.net
Well, it isn't a bug. Your pattern just doesn't work properly. Try 
adding '?' in the end of it.. 

See also:

http://php.net/manual/en/regexp.reference.meta.php
 [2009-09-18 19:16 UTC] anoop dot john at zyxware dot com
I am sorry but by adding a ? to the end of the pattern I would make the closing brace an optional match and the regex would match the content of the first braces till it stops matching and the content of the second brace completely including the closing brace. But the point is to not match the content of the first set of braces at all. 

The following is the results from the suggested change. The matches array now contain the partial match from the content of the first brace as matches[0] and the full match from the second brace as matches[1]. This is incorrect. The contents of the first pair of braces should not be matched at all. 

array ( 0 => array ( 0 => '(Euronext, NASDAQ: CRXL; AMEX,NYSE,NASDAQ,Swiss Exchange: CRX', 1 => 'Euronext, NASDAQ', 2 => 'CRXL', 3 => ' AMEX,NYSE,NASDAQ,Swiss Exchange', 4 => 'CRX', ), 1 => array ( 0 => '(AMEX,NYSE, Swiss Exchange:CRX;Nasdaq: QTWW)', 1 => 'AMEX,NYSE, Swiss Exchange', 2 => 'CRX', 3 => 'Nasdaq', 4 => 'QTWW', ), )

array ( 0 => array ( 0 => '(AMEX,NYSE, Swiss Exchange: CRX;Nasdaq:QTWW)', 1 => 'AMEX,NYSE, Swiss Exchange', 2 => 'CRX', 3 => 'Nasdaq', 4 => 'QTWW', ), )

To put you in context. The regex does this.

Match two sets of 

    combinations of one of the words
      from AMEX|NASDAQ|NasdaqGM|NasdaqGS|NYSE 
    and any number of 
      (words or groups of words separated by spaces)
    separated by commas

    paired with a stock ticker in full caps
      and separted from exchange name by :

    and both combinations enclosed within one brace
      and separated by ;

and remember

    1) Combination of exchange names of first stock
    2) First stock name
    3) Combination of exchange names of second stock
    4) Second stock name
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Mon Jul 07 10:01:34 2025 UTC