|  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #72685 No support for PCRE_NO_UTF_CHECK flag
Submitted: 2016-07-27 02:13 UTC Modified: 2019-03-18 12:26 UTC
Avg. Score:2.0 ± 0.0
Reproduced:0 of 1 (0.0%)
From: ju1ius at laposte dot net Assigned: nikic (profile)
Status: Closed Package: PCRE related
PHP Version: Irrelevant OS: Debian Sid
Private report: No CVE-ID: None
 [2016-07-27 02:13 UTC] ju1ius at laposte dot net
When matching in UTF8 mode ('u' flag), the `preg_match` function checks the validity of the entire input string on every call.

PCRE2 has a flag to disable this behavior: `PCRE2_NO_UTF_CHECK`, but PHP does not expose it.

This means that when matching a lot of times against the same (potentially long) input string (for example for lexical analysis), a lot of unnecessary computations are performed, leading to catastrophic performance (`O(n²)` instead of expected `O(n)`).

Test script:
// see online at
$input_size = 1e4;
$str = str_repeat('a', $input_size);

$start = microtime(true);
$pos = 0;
while(preg_match('/\G\w/', $str, $m, 0, $pos)) ++$pos;
$end = microtime(true);

echo '>>> NO u flag: ', number_format(($end - $start)*1000, 6), 'ms', PHP_EOL;

$str = str_repeat('e', $input_size);

$start = microtime(true);
$pos = 0;
while(preg_match('/\G\w/u', $str, $m, 0, $pos)) ++$pos;
$end = microtime(true);

echo '>>> WITH u flag: ', number_format(($end - $start)*1000, 6), 'ms', PHP_EOL;

Expected result:
I expect the two loops to take roughly the same amount of time, with the second one being slower by a very short margin.
The performance should be `O(n)` for the two.

Actual result:
With $input_size === 1e4:
>>> NO u flag:     9.632111ms
>>> WITH u flag: 109.670877ms

With $input_size === 1e5:
>>> NO u flag:       96.043110ms
>>> WITH u flag: 10,151.215076ms

With $input_size === 2*1e5:
>>> NO u flag:      188.354015ms
>>> WITH u flag: 40,387.295008ms

Looks like we have the expected `O(n)` in first case,
and `O(n²)` with the /u flag.


Add a Patch

Pull Requests

Pull requests:

Add a Pull Request


AllCommentsChangesGit/SVN commitsRelated reports
 [2016-07-27 02:48 UTC] ju1ius at laposte dot net
For reference, a quote from PCRE docs:

When the PCRE2_UTF option is set, the strings passed as patterns and subjects are (by default) checked for validity on entry to the relevant functions. If an invalid UTF string is passed, an negative error code is returned. The code unit offset to the offending character can be extracted from the match data block by calling pcre2_get_startchar(), which is used for this purpose after a UTF error.


The entire string is checked before any other processing takes place. In addition to checking the format of the string, there is a check to ensure that all code points lie in the range U+0 to U+10FFFF, excluding the surrogate area. The so-called "non-character" code points are not excluded because Unicode corrigendum #9 makes it clear that they should not be.


In some situations, you may already know that your strings are valid, and therefore want to skip these checks in order to improve performance, for example in the case of a long subject string that is being scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time, PCRE2 assumes that the pattern or subject it is given (respectively) contains only valid UTF code unit sequences.
 [2016-07-27 09:57 UTC]
Note that this is unrelated to PCRE2, which we do not support. PCRE also has PCRE_NO_UTF8_CHECK.
 [2016-08-20 13:33 UTC]
-Summary: No support for PCRE2_NO_UTF_CHECK flag +Summary: No support for PCRE_NO_UTF_CHECK flag
 [2019-03-18 12:26 UTC]
-Assigned To: +Assigned To: nikic
 [2019-03-18 12:26 UTC]
Opened to address this perf issue.
 [2019-03-18 15:59 UTC]
Automatic comment on behalf of
Log: Fixed bug #72685
 [2019-03-18 15:59 UTC]
-Status: Assigned +Status: Closed
PHP Copyright © 2001-2019 The PHP Group
All rights reserved.
Last updated: Tue Mar 26 16:01:26 2019 UTC