php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #74933 MBstring functions are much slower when called with encoding parameter
Submitted: 2017-07-17 01:08 UTC Modified: 2017-07-23 10:26 UTC
From: reinir dot puradinata at gmail dot com Assigned: nikic (profile)
Status: Closed Package: mbstring related
PHP Version: 7.1.7 OS: Windows
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: reinir dot puradinata at gmail dot com
New email:
PHP Version: OS:

 

 [2017-07-17 01:08 UTC] reinir dot puradinata at gmail dot com
Description:
------------
Several MBstring functions are much slower when called with encoding parameter than without encoding parameter.
Functions that exhibit this behavior are mb_strlen, mb_substr, mb_strpos, mb_strrpos.

For more information see:
https://stackoverflow.com/questions/45028018/a/45107408


Test script:
---------------
mb_internal_encoding("UTF-8");
echo "without encoding parameter:\n";

$a = microtime(true);
for($i=0; $i<100000; $i++){
    $n = mb_strlen("あえいおう");
}
$a = microtime(true)-$a;
echo "- mb_strlen: ".number_format($a*1000)." ms\n";

echo "\nwith encoding parameter:\n";

$b = microtime(true);
for($i=0; $i<100000; $i++){
    $n = mb_strlen("あえいおう", "UTF-8");
}
$b = microtime(true)-$b;
echo "- mb_strlen: ".number_format($b*1000)." ms (".number_format(($b-$a)*100/$a)."% slower)\n";


Expected result:
----------------
Because the character encoding is UTF-8 in both cases, I expected to see similar performance.
In other words, 0% slower or close to it.


Actual result:
--------------
When called with encoding parameter, performance drops greatly.
Example output from the test script:

without encoding parameter:
- mb_strlen: 14 ms

with encoding parameter:
- mb_strlen: 585 ms (4,186% slower)


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2017-07-18 07:53 UTC] jhdxr@php.net
-Status: Open +Status: Not a bug
 [2017-07-18 07:53 UTC] jhdxr@php.net
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

Whenever you passed in a string, php has to converted it into internal enum type, and it's why the second case is so slow in your test file.
 [2017-07-18 08:15 UTC] nikic@php.net
-Status: Not a bug +Status: Re-Opened
 [2017-07-18 08:15 UTC] nikic@php.net
There are better ways to look up an encoding than doing an O(n) search: https://github.com/php/php-src/blob/6053987bc27e8dede37f437193a5cad448f99bce/ext/mbstring/libmbfl/mbfl/mbfl_encoding.c#L224

This should be turned into a lookup table. Additionally caching the last used encoding probably makes sense.
 [2017-07-23 08:44 UTC] nikic@php.net
-Assigned To: +Assigned To: nikic
 [2017-07-23 10:26 UTC] nikic@php.net
-Status: Re-Opened +Status: Closed
 [2017-07-23 10:26 UTC] nikic@php.net
This is fixed in master with the introduction of an encoding cache, which avoids the expensive lookup if it's the same as the encoding that was used last, which should be the norm. Yet to be seen if these changes will be in PHP 7.2 as well.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 17:01:32 2024 UTC