PHP :: Bug #44336 :: preg_replace with /u (unicode/UTF-8) and many hits = very bad performance

Bug #44336	preg_replace with /u (unicode/UTF-8) and many hits = very bad performance
Submitted:	2008-03-05 14:10 UTC	Modified:	2009-01-13 19:23 UTC
From:	frode at coretrek dot com	Assigned:	nlopess (profile)
Status:	Closed	Package:	PCRE related
PHP Version:	5.2.6RC1	OS:	Debian GNU/Linux 4.0r3
Private report:	No	CVE-ID:	None

View Developer Edit

Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.

Password:

Status:
Package:
Bug Type:
Summary:
From:	frode at coretrek dot com
New email:
PHP Version:		OS:

New Comment:

[2008-03-05 14:10 UTC] frode at coretrek dot com

Description:
------------
The "/u" modifier with preg_replace() yields extraordinarily poor performance when there are a lot of matches.

I tried to run php through valgrind/callgrind and kcachegrind, and it seems that the time is mostly spent in "php__pcre_valid_utf8()", Perhaps this method is (unnecessarily) called over and over again, once for each substring match/replace?

This happens at least in both PHP 5.2.5 and PHP 5.2.6RC1

Reproduce code:
---------------
<?php
$goodstring = str_repeat('Test',  50000);
$badstring  = str_repeat('Test ', 50000);

$t = microtime(true);
preg_replace('/\\s+/', ' ', $goodstring);
echo "matches: NO  unicode: NO  ".(microtime(true)-$t)." sec\n";

$t = microtime(true);
preg_replace('/\\s+/u', ' ', $goodstring);
echo "matches: NO  unicode: YES ".(microtime(true)-$t)." sec\n";

$t = microtime(true);
preg_replace('/\\s+/', ' ', $badstring);
echo "matches: YES unicode: NO  ".(microtime(true)-$t)." sec\n";

$t = microtime(true);
preg_replace('/\\s+/u', ' ', $badstring);
echo "matches: YES unicode: YES ".(microtime(true)-$t)." sec\n";    


Expected result:
----------------
Similar performance for all runs (less than 1 second)

Actual result:
--------------
matches: NO  unicode: NO  0.020231962204 sec
matches: NO  unicode: YES 0.0206818580627 sec
matches: YES unicode: NO  0.0361981391907 sec
matches: YES unicode: YES 27.6555769444 sec

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2008-03-05 15:44 UTC] frode at coretrek dot com

According to ext/pcre/pcrelib/doc/pcre.txt and ext/pcre/pcrelib/ChangeLog there is a flag PCRE_NO_UTF8_CHECK which was added in libpcre 4.5:

> 3. When matching a UTF-8 string, the test for a valid string at the >start has
>    been extended. If start_offset is not zero, PCRE now checks that >it points
>    to a byte that is the start of a UTF-8 character. If not, it >returns
>    PCRE_ERROR_BADUTF8_OFFSET (-11). Note: the whole string is still >checked;
>    this is necessary because there may be backward assertions in the >pattern.
>    When matching the same subject several times, it may save >resources to use
>    PCRE_NO_UTF8_CHECK on all but the first call if the string is >long.

I tried patching ext/pcre/php_pcre.c and adding this PCRE_NO_UTF8_CHECK to the options passed to pcre_exec() (setting the flag after the first call to pcre_exec()) and it speeds up execution tremendously.

With the patch I now get:

matches: NO  unicode: NO  0.024386882781982 sec
matches: NO  unicode: YES 0.021436929702759 sec
matches: YES unicode: NO  0.060844898223877 sec
matches: YES unicode: YES 0.062279939651489 sec

I'll attach the patch shortly, but someone should review it to make sure it doesn't open up any security holes or buffer overflows (for example, would it be possible to create an invalid UTF-8 string by using a replacement pattern containing an invalid UTF-8 string?)

[2008-03-05 15:46 UTC] frode at coretrek dot com

Here's the patch. If it doesn't come through cleanly, it's also available at http://apollo.coretrek.com/~frode/phpbug-44336.patch.txt

--- php_pcre.c.orig	2008-03-05 16:37:09.000000000 +0100
+++ php_pcre.c	2008-03-05 16:38:18.000000000 +0100
@@ -599,20 +599,23 @@
 
 	match = NULL;
 	matched = 0;
 	PCRE_G(error_code) = PHP_PCRE_NO_ERROR;
 	
 	do {
 		/* Execute the regular expression. */
 		count = pcre_exec(pce->re, extra, subject, subject_len, start_offset,
 						  exoptions|g_notempty, offsets, size_offsets);
 
+		/* Prevent lengthy UTF8 check on subsequent pcre_exec() calls to save time (See PHP bug 44336) */
+		exoptions |= PCRE_NO_UTF8_CHECK;
+		
 		/* Check for too many substrings condition. */	
 		if (count == 0) {
 			php_error_docref(NULL TSRMLS_CC, E_NOTICE, "Matched, but too many substrings");
 			count = size_offsets/3;
 		}
 
 		/* If something has matched */
 		if (count > 0) {
 			matched++;
 			match = subject + offsets[0];
@@ -1002,20 +1005,23 @@
 	match = NULL;
 	*result_len = 0;
 	start_offset = 0;
 	PCRE_G(error_code) = PHP_PCRE_NO_ERROR;
 	
 	while (1) {
 		/* Execute the regular expression. */
 		count = pcre_exec(pce->re, extra, subject, subject_len, start_offset,
 						  exoptions|g_notempty, offsets, size_offsets);
 		
+		/* Prevent lengthy UTF8 check on subsequent pcre_exec() calls to save time (See PHP bug 44336) */
+		exoptions |= PCRE_NO_UTF8_CHECK;
+		
 		/* Check for too many substrings condition. */
 		if (count == 0) {
 			php_error_docref(NULL TSRMLS_CC,E_NOTICE, "Matched, but too many substrings");
 			count = size_offsets/3;
 		}
 
 		piece = subject + start_offset;
 
 		if (count > 0 && (limit == -1 || limit > 0)) {
 			if (replace_count) {
@@ -1439,20 +1445,23 @@
 	last_match = subject;
 	match = NULL;
 	PCRE_G(error_code) = PHP_PCRE_NO_ERROR;
 	
 	/* Get next piece if no limit or limit not yet reached and something matched*/
 	while ((limit_val == -1 || limit_val > 1)) {
 		count = pcre_exec(pce->re, extra, subject,
 						  subject_len, start_offset,
 						  exoptions|g_notempty, offsets, size_offsets);
 
+		/* Prevent lengthy UTF8 check on subsequent pcre_exec() calls to save time (See PHP bug 44336) */
+		exoptions |= PCRE_NO_UTF8_CHECK;
+		
 		/* Check for too many substrings condition. */
 		if (count == 0) {
 			php_error_docref(NULL TSRMLS_CC,E_NOTICE, "Matched, but too many substrings");
 			count = size_offsets/3;
 		}
 				
 		/* If something matched */
 		if (count > 0) {
 			match = subject + offsets[0];

[2008-03-05 16:34 UTC] felipe@php.net

Nice work! :) 

Assigned to maintainer.

[2008-03-08 12:04 UTC] nlopess@php.net

This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.

this went to the PHP 5.3 and 6 branches only and won't be ported to PHP 5.2.

[2009-01-13 19:23 UTC] andrei@php.net

Fixed in PHP_5_2 now as well.

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Wed Oct 29 10:00:01 2025 UTC