php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #77521 Validate regexes
Submitted: 2019-01-25 15:53 UTC Modified: 2019-01-25 19:42 UTC
Votes:1
Avg. Score:3.0 ± 0.0
Reproduced:0 of 0 (0.0%)
From: flip101 at gmail dot com Assigned:
Status: Open Package: PCRE related
PHP Version: 7.3.1 OS: Ubuntu 18.04
Private report: No CVE-ID: None
Have you experienced this issue?
Rate the importance of this bug to you:

 [2019-01-25 15:53 UTC] flip101 at gmail dot com
Description:
------------
At the moment there is poor support to validate a regex. Since the inception of PHP where it was mainly used for creating personal home pages it's been used in more domains. Therefor having the ability to validate regexes would be a useful thing.

As far as i know regexes can be validated to the PCRE engine in the following way:
1. make a custom error handler for PHP warning
2. register customer error handler
3. call preg_match with invalid regex
4. unregister custom error handler.
5. throw exception (or another way of dealing with the error)

The motivation that led me here is the following library and issue:
https://hoa-project.net/En/Literature/Hack/Compiler.html#PP_language
https://github.com/hoaproject/Compiler/issues/15

But i believe there will be other projects now and in the feature that would benefit from a change in PHP PCRE module.

I propose the following changes that should be backwards compatible:
1. Add constant PREG_INVALID_PATTERN (representing int 7) to https://secure.php.net/manual/en/function.preg-last-error.php which gets set when a pattern is an invalid regex.
2. Add a  bool preg_validate_pattern($string)  function that only calls the PCRE2 compile function and checks for succesful compilation. Skipping the step of actually trying to match a subject which is usually done.

Test script:
---------------
<?php

$invalid_pattern = '/(\d+/';
$dummy_subject = '';

preg_match($invalid_pattern, $dummy_subject);


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2019-01-25 16:17 UTC] danack@php.net
I believe using preg_last_error() and silencing the preg_match error gives a sane way to validate if a regex is acceptable to PCRE.


$invalid_pattern = '/(\d+/';
$dummy_subject = '';

@preg_match($invalid_pattern, $dummy_subject);

$lastError = preg_last_error();

if ($lastError) {
    echo "something was wrong with the regex";
}


Does that not cover detecting valid regexes?

Admittedly getting the exact position of the error would be more than a little useful...
 [2019-01-25 16:49 UTC] flip101 at gmail dot com
Hi danack. Before when i tested this my PHP shell (psysh) was not showing the PREG_INTERNAL_ERROR value. When i put your code in a file i see this error too. That's great now i don't need a custom error handler. However i still like a separate function for validating the regex. Unfortunately adding a PREG_INVALID_PATTERN won't be backwards compatible now because people might already be checking for $preg === 1 :(
I'm not sure if PREG_INTERNAL_ERROR can happen in other ways that having an invalid pattern. It would be good to make the invalid pattern case explicit.
 [2019-01-25 17:02 UTC] flip101 at gmail dot com
By the way there is also the return value of preg_match set to false when the pattern match fails. But both the preg_last_error function and the return value can not differentiate between an invalid pattern and another general error.

As you point out danack it would be good to have better information about the error. Perhaps preg_validate_pattern could return an array like:

$return = array('message' => 'missing closing parenthesis', 'offset' => 4);
 [2019-01-25 19:42 UTC] requinix@php.net
I don't understand the "PHP is popular therefore it needs a regex validator function" argument...

I would think that a deliberate need to validate a regex would be uncommon outside of an actual regex testing/validation library. Surely most of the time it would be developer error? Writing a bad regex, or incorporating an unknown value that should have been preg_quote()d. Normal error reporting situations address those.

An offset would be nice for a bad regex, but it would also help with PREG_BAD_UTF8_ERROR. And maybe others. What if preg_last_error could be extended with additional information about the error? A by-ref argument for additional information, like an array to be generic or just an int for an offset (pattern or data, depending on the error).
 [2019-02-18 03:27 UTC] tandre@php.net
This is more of an inconvenience than a major issue in practice - this can be abstracted into a composer library (which may already exist, I haven't checked).

- I personally want this feature
- The amount of setup and teardown involved is inefficient, and reinventing regex validation is error prone and misses edge cases

I slightly prefer preg_replace over preg_match, because preg_replace will warn about the `/e` modifier being removed and doing nothing, while preg_match doesn't.

                $result = @\preg_replace($pattern, '', '');
                if ($result === false || $result === null) {
                    return \error_get_last() ?? [];
                }
                return null;

I ran into similar issues writing a static analysis plugin that would warn about invalid PCRE regexes passed to preg_match, etc. The code used is 
https://github.com/phan/phan/blob/1.2.3/.phan/plugins/PregRegexCheckerPlugin.php#L41-L58
 
PHP Copyright © 2001-2019 The PHP Group
All rights reserved.
Last updated: Thu Apr 25 13:01:25 2019 UTC