php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #44336 preg_replace with /u (unicode/UTF-8) and many hits = very bad performance
Submitted: 2008-03-05 14:10 UTC Modified: 2009-01-13 19:23 UTC
From: frode at coretrek dot com Assigned: nlopess
Status: Closed Package: PCRE related
PHP Version: 5.2.6RC1 OS: Debian GNU/Linux 4.0r3
Private report: No CVE-ID:
 [2008-03-05 14:10 UTC] frode at coretrek dot com
Description:
------------
The "/u" modifier with preg_replace() yields extraordinarily poor performance when there are a lot of matches.

I tried to run php through valgrind/callgrind and kcachegrind, and it seems that the time is mostly spent in "php__pcre_valid_utf8()", Perhaps this method is (unnecessarily) called over and over again, once for each substring match/replace?

This happens at least in both PHP 5.2.5 and PHP 5.2.6RC1

Reproduce code:
---------------
<?php
$goodstring = str_repeat('Test',  50000);
$badstring  = str_repeat('Test ', 50000);

$t = microtime(true);
preg_replace('/\\s+/', ' ', $goodstring);
echo "matches: NO  unicode: NO  ".(microtime(true)-$t)." sec\n";

$t = microtime(true);
preg_replace('/\\s+/u', ' ', $goodstring);
echo "matches: NO  unicode: YES ".(microtime(true)-$t)." sec\n";

$t = microtime(true);
preg_replace('/\\s+/', ' ', $badstring);
echo "matches: YES unicode: NO  ".(microtime(true)-$t)." sec\n";

$t = microtime(true);
preg_replace('/\\s+/u', ' ', $badstring);
echo "matches: YES unicode: YES ".(microtime(true)-$t)." sec\n";    


Expected result:
----------------
Similar performance for all runs (less than 1 second)

Actual result:
--------------
matches: NO  unicode: NO  0.020231962204 sec
matches: NO  unicode: YES 0.0206818580627 sec
matches: YES unicode: NO  0.0361981391907 sec
matches: YES unicode: YES 27.6555769444 sec


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2008-03-05 15:44 UTC] frode at coretrek dot com
According to ext/pcre/pcrelib/doc/pcre.txt and ext/pcre/pcrelib/ChangeLog there is a flag PCRE_NO_UTF8_CHECK which was added in libpcre 4.5:

> 3. When matching a UTF-8 string, the test for a valid string at the >start has
>    been extended. If start_offset is not zero, PCRE now checks that >it points
>    to a byte that is the start of a UTF-8 character. If not, it >returns
>    PCRE_ERROR_BADUTF8_OFFSET (-11). Note: the whole string is still >checked;
>    this is necessary because there may be backward assertions in the >pattern.
>    When matching the same subject several times, it may save >resources to use
>    PCRE_NO_UTF8_CHECK on all but the first call if the string is >long.

I tried patching ext/pcre/php_pcre.c and adding this PCRE_NO_UTF8_CHECK to the options passed to pcre_exec() (setting the flag after the first call to pcre_exec()) and it speeds up execution tremendously.

With the patch I now get:

matches: NO  unicode: NO  0.024386882781982 sec
matches: NO  unicode: YES 0.021436929702759 sec
matches: YES unicode: NO  0.060844898223877 sec
matches: YES unicode: YES 0.062279939651489 sec

I'll attach the patch shortly, but someone should review it to make sure it doesn't open up any security holes or buffer overflows (for example, would it be possible to create an invalid UTF-8 string by using a replacement pattern containing an invalid UTF-8 string?)
 [2008-03-05 15:46 UTC] frode at coretrek dot com
Here's the patch. If it doesn't come through cleanly, it's also available at http://apollo.coretrek.com/~frode/phpbug-44336.patch.txt

--- php_pcre.c.orig	2008-03-05 16:37:09.000000000 +0100
+++ php_pcre.c	2008-03-05 16:38:18.000000000 +0100
@@ -599,20 +599,23 @@
 
 	match = NULL;
 	matched = 0;
 	PCRE_G(error_code) = PHP_PCRE_NO_ERROR;
 	
 	do {
 		/* Execute the regular expression. */
 		count = pcre_exec(pce->re, extra, subject, subject_len, start_offset,
 						  exoptions|g_notempty, offsets, size_offsets);
 
+		/* Prevent lengthy UTF8 check on subsequent pcre_exec() calls to save time (See PHP bug 44336) */
+		exoptions |= PCRE_NO_UTF8_CHECK;
+		
 		/* Check for too many substrings condition. */	
 		if (count == 0) {
 			php_error_docref(NULL TSRMLS_CC, E_NOTICE, "Matched, but too many substrings");
 			count = size_offsets/3;
 		}
 
 		/* If something has matched */
 		if (count > 0) {
 			matched++;
 			match = subject + offsets[0];
@@ -1002,20 +1005,23 @@
 	match = NULL;
 	*result_len = 0;
 	start_offset = 0;
 	PCRE_G(error_code) = PHP_PCRE_NO_ERROR;
 	
 	while (1) {
 		/* Execute the regular expression. */
 		count = pcre_exec(pce->re, extra, subject, subject_len, start_offset,
 						  exoptions|g_notempty, offsets, size_offsets);
 		
+		/* Prevent lengthy UTF8 check on subsequent pcre_exec() calls to save time (See PHP bug 44336) */
+		exoptions |= PCRE_NO_UTF8_CHECK;
+		
 		/* Check for too many substrings condition. */
 		if (count == 0) {
 			php_error_docref(NULL TSRMLS_CC,E_NOTICE, "Matched, but too many substrings");
 			count = size_offsets/3;
 		}
 
 		piece = subject + start_offset;
 
 		if (count > 0 && (limit == -1 || limit > 0)) {
 			if (replace_count) {
@@ -1439,20 +1445,23 @@
 	last_match = subject;
 	match = NULL;
 	PCRE_G(error_code) = PHP_PCRE_NO_ERROR;
 	
 	/* Get next piece if no limit or limit not yet reached and something matched*/
 	while ((limit_val == -1 || limit_val > 1)) {
 		count = pcre_exec(pce->re, extra, subject,
 						  subject_len, start_offset,
 						  exoptions|g_notempty, offsets, size_offsets);
 
+		/* Prevent lengthy UTF8 check on subsequent pcre_exec() calls to save time (See PHP bug 44336) */
+		exoptions |= PCRE_NO_UTF8_CHECK;
+		
 		/* Check for too many substrings condition. */
 		if (count == 0) {
 			php_error_docref(NULL TSRMLS_CC,E_NOTICE, "Matched, but too many substrings");
 			count = size_offsets/3;
 		}
 				
 		/* If something matched */
 		if (count > 0) {
 			match = subject + offsets[0];
 [2008-03-05 16:34 UTC] felipe@php.net
Nice work! :) 

Assigned to maintainer.
 [2008-03-08 12:04 UTC] nlopess@php.net
This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.

this went to the PHP 5.3 and 6 branches only and won't be ported to PHP 5.2.
 [2009-01-13 19:23 UTC] andrei@php.net
Fixed in PHP_5_2 now as well.
 
PHP Copyright © 2001-2014 The PHP Group
All rights reserved.
Last updated: Sun Apr 20 20:02:01 2014 UTC