php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #74933 MBstring functions are much slower when called with encoding parameter
Submitted: 2017-07-17 01:08 UTC Modified: 2017-07-23 10:26 UTC
From: reinir dot puradinata at gmail dot com Assigned: nikic (profile)
Status: Closed Package: mbstring related
PHP Version: 7.1.7 OS: Windows
Private report: No CVE-ID: None
 [2017-07-17 01:08 UTC] reinir dot puradinata at gmail dot com
Description:
------------
Several MBstring functions are much slower when called with encoding parameter than without encoding parameter.
Functions that exhibit this behavior are mb_strlen, mb_substr, mb_strpos, mb_strrpos.

For more information see:
https://stackoverflow.com/questions/45028018/a/45107408


Test script:
---------------
mb_internal_encoding("UTF-8");
echo "without encoding parameter:\n";

$a = microtime(true);
for($i=0; $i<100000; $i++){
    $n = mb_strlen("あえいおう");
}
$a = microtime(true)-$a;
echo "- mb_strlen: ".number_format($a*1000)." ms\n";

echo "\nwith encoding parameter:\n";

$b = microtime(true);
for($i=0; $i<100000; $i++){
    $n = mb_strlen("あえいおう", "UTF-8");
}
$b = microtime(true)-$b;
echo "- mb_strlen: ".number_format($b*1000)." ms (".number_format(($b-$a)*100/$a)."% slower)\n";


Expected result:
----------------
Because the character encoding is UTF-8 in both cases, I expected to see similar performance.
In other words, 0% slower or close to it.


Actual result:
--------------
When called with encoding parameter, performance drops greatly.
Example output from the test script:

without encoding parameter:
- mb_strlen: 14 ms

with encoding parameter:
- mb_strlen: 585 ms (4,186% slower)


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2017-07-18 07:53 UTC] jhdxr@php.net
-Status: Open +Status: Not a bug
 [2017-07-18 07:53 UTC] jhdxr@php.net
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

Whenever you passed in a string, php has to converted it into internal enum type, and it's why the second case is so slow in your test file.
 [2017-07-18 08:15 UTC] nikic@php.net
-Status: Not a bug +Status: Re-Opened
 [2017-07-18 08:15 UTC] nikic@php.net
There are better ways to look up an encoding than doing an O(n) search: https://github.com/php/php-src/blob/6053987bc27e8dede37f437193a5cad448f99bce/ext/mbstring/libmbfl/mbfl/mbfl_encoding.c#L224

This should be turned into a lookup table. Additionally caching the last used encoding probably makes sense.
 [2017-07-23 08:44 UTC] nikic@php.net
-Assigned To: +Assigned To: nikic
 [2017-07-23 10:26 UTC] nikic@php.net
-Status: Re-Opened +Status: Closed
 [2017-07-23 10:26 UTC] nikic@php.net
This is fixed in master with the introduction of an encoding cache, which avoids the expensive lookup if it's the same as the encoding that was used last, which should be the norm. Yet to be seen if these changes will be in PHP 7.2 as well.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Apr 26 22:01:29 2024 UTC