PHP :: Bug #72685 :: No support for PCRE_NO_UTF

Bug #72685

No support for PCRE_NO_UTF_CHECK flag

Submitted:

2016-07-27 02:13 UTC

Modified:

2019-03-18 12:26 UTC

Votes:	1
Avg. Score:	2.0 ± 0.0
Reproduced:	0 of 1 (0.0%)

From:

ju1ius at laposte dot net

Assigned:

nikic (profile)

Status:

Closed

Package:

PCRE related

PHP Version:

Irrelevant

OS:

Debian Sid

Private report:

CVE-ID:

None

View Developer Edit

[2016-07-27 02:13 UTC] ju1ius at laposte dot net

Description:
------------
When matching in UTF8 mode ('u' flag), the `preg_match` function checks the validity of the entire input string on every call.

PCRE2 has a flag to disable this behavior: `PCRE2_NO_UTF_CHECK`, but PHP does not expose it.

This means that when matching a lot of times against the same (potentially long) input string (for example for lexical analysis), a lot of unnecessary computations are performed, leading to catastrophic performance (`O(n²)` instead of expected `O(n)`).





Test script:
---------------
// see online at https://3v4l.org/Duv7g
$input_size = 1e4;
$str = str_repeat('a', $input_size);

$start = microtime(true);
$pos = 0;
while(preg_match('/\G\w/', $str, $m, 0, $pos)) ++$pos;
$end = microtime(true);

echo '>>> NO u flag: ', number_format(($end - $start)*1000, 6), 'ms', PHP_EOL;

$str = str_repeat('e', $input_size);

$start = microtime(true);
$pos = 0;
while(preg_match('/\G\w/u', $str, $m, 0, $pos)) ++$pos;
$end = microtime(true);

echo '>>> WITH u flag: ', number_format(($end - $start)*1000, 6), 'ms', PHP_EOL;

Expected result:
----------------
I expect the two loops to take roughly the same amount of time, with the second one being slower by a very short margin.
The performance should be `O(n)` for the two.

Actual result:
--------------
With $input_size === 1e4:
>>> NO u flag:     9.632111ms
>>> WITH u flag: 109.670877ms

With $input_size === 1e5:
>>> NO u flag:       96.043110ms
>>> WITH u flag: 10,151.215076ms

With $input_size === 2*1e5:
>>> NO u flag:      188.354015ms
>>> WITH u flag: 40,387.295008ms

Looks like we have the expected `O(n)` in first case,
and `O(n²)` with the /u flag.

Patches

Pull Requests

Pull requests:

adds PREG_NO_UTF8_CHECK flag (php-src/2035)

History

AllCommentsChangesGit/SVN commitsRelated reports

[2016-07-27 02:48 UTC] ju1ius at laposte dot net

For reference, a quote from PCRE docs:
http://www.pcre.org/current/doc/html/pcre2unicode.html


When the PCRE2_UTF option is set, the strings passed as patterns and subjects are (by default) checked for validity on entry to the relevant functions. If an invalid UTF string is passed, an negative error code is returned. The code unit offset to the offending character can be extracted from the match data block by calling pcre2_get_startchar(), which is used for this purpose after a UTF error.

...

The entire string is checked before any other processing takes place. In addition to checking the format of the string, there is a check to ensure that all code points lie in the range U+0 to U+10FFFF, excluding the surrogate area. The so-called "non-character" code points are not excluded because Unicode corrigendum #9 makes it clear that they should not be.

...

In some situations, you may already know that your strings are valid, and therefore want to skip these checks in order to improve performance, for example in the case of a long subject string that is being scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time, PCRE2 assumes that the pattern or subject it is given (respectively) contains only valid UTF code unit sequences.

[2016-07-27 09:57 UTC] nikic@php.net

Note that this is unrelated to PCRE2, which we do not support. PCRE also has PCRE_NO_UTF8_CHECK.

[2016-08-20 13:33 UTC] cmb@php.net

-Summary: No support for PCRE2_NO_UTF_CHECK flag +Summary: No support for PCRE_NO_UTF_CHECK flag

[2019-03-18 12:26 UTC] nikic@php.net

-Assigned To: +Assigned To: nikic

[2019-03-18 12:26 UTC] nikic@php.net

Opened https://github.com/php/php-src/pull/3957 to address this perf issue.

[2019-03-18 15:59 UTC] nikic@php.net

Automatic comment on behalf of nikita.ppv@gmail.com
Revision: http://git.php.net/?p=php-src.git;a=commit;h=2b9acd37f0a13572684dde80e3e56d5c1b2ec045
Log: Fixed bug #72685

[2019-03-18 15:59 UTC] nikic@php.net

-Status: Assigned +Status: Closed

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2026 The PHP Group All rights reserved.	Last updated: Tue Jul 28 08:00:01 2026 UTC