php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #72685 No support for PCRE_NO_UTF_CHECK flag
Submitted: 2016-07-27 02:13 UTC Modified: 2019-03-18 12:26 UTC
Votes:1
Avg. Score:2.0 ± 0.0
Reproduced:0 of 1 (0.0%)
From: ju1ius at laposte dot net Assigned: nikic (profile)
Status: Closed Package: PCRE related
PHP Version: Irrelevant OS: Debian Sid
Private report: No CVE-ID: None
View Add Comment Developer Edit
Anyone can comment on a bug. Have a simpler test case? Does it work for you on a different platform? Let us know!
Just going to say 'Me too!'? Don't clutter the database with that please !
Your email address:
MUST BE VALID
Solve the problem:
41 - 6 = ?
Subscribe to this entry?

 
 [2016-07-27 02:13 UTC] ju1ius at laposte dot net
Description:
------------
When matching in UTF8 mode ('u' flag), the `preg_match` function checks the validity of the entire input string on every call.

PCRE2 has a flag to disable this behavior: `PCRE2_NO_UTF_CHECK`, but PHP does not expose it.

This means that when matching a lot of times against the same (potentially long) input string (for example for lexical analysis), a lot of unnecessary computations are performed, leading to catastrophic performance (`O(n²)` instead of expected `O(n)`).





Test script:
---------------
// see online at https://3v4l.org/Duv7g
$input_size = 1e4;
$str = str_repeat('a', $input_size);

$start = microtime(true);
$pos = 0;
while(preg_match('/\G\w/', $str, $m, 0, $pos)) ++$pos;
$end = microtime(true);

echo '>>> NO u flag: ', number_format(($end - $start)*1000, 6), 'ms', PHP_EOL;

$str = str_repeat('e', $input_size);

$start = microtime(true);
$pos = 0;
while(preg_match('/\G\w/u', $str, $m, 0, $pos)) ++$pos;
$end = microtime(true);

echo '>>> WITH u flag: ', number_format(($end - $start)*1000, 6), 'ms', PHP_EOL;

Expected result:
----------------
I expect the two loops to take roughly the same amount of time, with the second one being slower by a very short margin.
The performance should be `O(n)` for the two.

Actual result:
--------------
With $input_size === 1e4:
>>> NO u flag:     9.632111ms
>>> WITH u flag: 109.670877ms

With $input_size === 1e5:
>>> NO u flag:       96.043110ms
>>> WITH u flag: 10,151.215076ms

With $input_size === 2*1e5:
>>> NO u flag:      188.354015ms
>>> WITH u flag: 40,387.295008ms

Looks like we have the expected `O(n)` in first case,
and `O(n²)` with the /u flag.



Patches

Add a Patch

Pull Requests

Pull requests:

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2016-07-27 02:48 UTC] ju1ius at laposte dot net
For reference, a quote from PCRE docs:
http://www.pcre.org/current/doc/html/pcre2unicode.html


When the PCRE2_UTF option is set, the strings passed as patterns and subjects are (by default) checked for validity on entry to the relevant functions. If an invalid UTF string is passed, an negative error code is returned. The code unit offset to the offending character can be extracted from the match data block by calling pcre2_get_startchar(), which is used for this purpose after a UTF error.

...

The entire string is checked before any other processing takes place. In addition to checking the format of the string, there is a check to ensure that all code points lie in the range U+0 to U+10FFFF, excluding the surrogate area. The so-called "non-character" code points are not excluded because Unicode corrigendum #9 makes it clear that they should not be.

...

In some situations, you may already know that your strings are valid, and therefore want to skip these checks in order to improve performance, for example in the case of a long subject string that is being scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time, PCRE2 assumes that the pattern or subject it is given (respectively) contains only valid UTF code unit sequences.
 [2016-07-27 09:57 UTC] nikic@php.net
Note that this is unrelated to PCRE2, which we do not support. PCRE also has PCRE_NO_UTF8_CHECK.
 [2016-08-20 13:33 UTC] cmb@php.net
-Summary: No support for PCRE2_NO_UTF_CHECK flag +Summary: No support for PCRE_NO_UTF_CHECK flag
 [2019-03-18 12:26 UTC] nikic@php.net
-Assigned To: +Assigned To: nikic
 [2019-03-18 12:26 UTC] nikic@php.net
Opened https://github.com/php/php-src/pull/3957 to address this perf issue.
 [2019-03-18 15:59 UTC] nikic@php.net
Automatic comment on behalf of nikita.ppv@gmail.com
Revision: http://git.php.net/?p=php-src.git;a=commit;h=2b9acd37f0a13572684dde80e3e56d5c1b2ec045
Log: Fixed bug #72685
 [2019-03-18 15:59 UTC] nikic@php.net
-Status: Assigned +Status: Closed
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Apr 19 09:01:27 2024 UTC