PHP :: Request #79545 :: mbstring functions are about 10x slower

mbstring functions are about 10x slower

Submitted:

2020-04-30 11:00 UTC

Modified:

Votes:	2
Avg. Score:	4.5 ± 0.5
Reproduced:	1 of 1 (100.0%)
Same Version:	0 (0.0%)
Same OS:	1 (100.0%)

From:

michael dot vorisek at emailc dot z

Assigned:

Status:

Open

Package:

Performance problem

PHP Version:

7.4.5

OS:

Linux

Private report:

CVE-ID:

None

View Add Comment Developer Edit

Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.

php.net Username: php.net Password:

Quick Fix:	(description)
	Block user comment
Status:		Assign to:
Package:
Bug Type:
Summary:
From:	michael dot vorisek at emailc dot z
New email:
PHP Version:		OS:

New/Additional Comment:

[2020-04-30 11:00 UTC] michael dot vorisek at emailc dot z

Description:
------------
https://3v4l.org/bOdXP

Currently the performance difference between single-byte and multi-byte string function is about 5x - 25x (yes, +400% - +2400%).

Please verify and comment.

If the performance can not be improved with mbstring, then checking if the input string does contain only 0x00 - 0x7f chars will improve the overall/real PHP performace - check can be done very quickly using modern/vectored CPU instructions - and if satisfied, use equivalent single-byte string function instead of multi-byte one.

Test script:
---------------
$cnt = 100000;

$strs = [
    'empty' => '',
    'short' => 'zluty kun',
    'short_with_uc' => 'zluty Kun',
    'long' => str_repeat('this is about 10000 chars long string', 270),
    'long_with_uc' => str_repeat('this is about 10000 chars long String', 270),
    'short_utf8' => 'žlutý kůň',
    'short_utf8_with_uc' => 'Žlutý kŮň',
];

foreach ($strs as $k => $str) {
    $a1 = microtime(true);
    for($i=0; $i < $cnt; ++$i){
        $res = strtolower($str);
    }
    $t1 = microtime(true) - $a1;
    // echo 'it took ' . round($t1 * 1000, 3) . ' ms for ++$i'."\n";

    $a2 = microtime(true);
    for($i=0; $i < $cnt; $i++){
        $res = mb_strtolower($str);
    }
    $t2 = microtime(true) - $a2;
    // echo 'it took ' . round($t2 * 1000, 3) . ' ms for $i++'."\n";

    echo 'strtolower is '.round($t2/$t1, 2).'x faster than mb_strtolower for ' . $k . "\n\n";
}


Expected result:
----------------
strtolower is 6.73x faster than mb_strtolower for empty

strtolower is 9.78x faster than mb_strtolower for short

strtolower is 7.13x faster than mb_strtolower for short_with_uc

strtolower is 24.88x faster than mb_strtolower for long

strtolower is 23.12x faster than mb_strtolower for long_with_uc

strtolower is 9.05x faster than mb_strtolower for short_utf8

strtolower is 9.93x faster than mb_strtolower for short_utf8_with_uc



Actual result:
--------------
Difference should be no larger than 1.5x at least for single-byte strings.

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports

[2020-06-09 20:40 UTC] alexinbeijing at gmail dot com

This is a tricky one.

It looks like there is nothing really crazy going on in mbstring which is causing this vast disparity in performance.

mbstring bounces through a series of function calls to process *each byte* of the input string. None of those function calls is doing a lot of work, but when you have 200,000,000 bytes to work on, the overhead really adds up.

Getting significantly more performance out of it might require major redesign.

[2020-06-23 15:49 UTC] alexinbeijing at gmail dot com

I have studied this problem a bit more and am looking at ways to speed up case conversion of multi-byte strings.

Please note there is another reason why the plain `strtolower` is so much faster: It uses SSE2 instructions to process blocks of 16 bytes at once. This accounts for a large part of the difference in performance.

[2020-06-24 20:43 UTC] michael dot vorisek at email dot cz

> Please note there is another reason why the plain `strtolower` is so much faster: It uses SSE2 instructions to process blocks of 16 bytes at once. This accounts for a large part of the difference in performance.

I replaced "strtolower" with "strtoupper" in the test code above and the difference between strtolower/strtoupper is not the decisive.

for "strtolower":
strtolower is 6.71x faster than mb_strtolower for empty
strtolower is 10.61x faster than mb_strtolower for short
strtolower is 7.22x faster than mb_strtolower for short_with_uc
strtolower is 25.73x faster than mb_strtolower for long
strtolower is 23.38x faster than mb_strtolower for long_with_uc
strtolower is 10.86x faster than mb_strtolower for short_utf8
strtolower is 11.08x faster than mb_strtolower for short_utf8_with_uc

for "strtoupper":
strtoupper is 8.19x faster than mb_strtoupper for empty
strtoupper is 7.72x faster than mb_strtoupper for short
strtoupper is 7.66x faster than mb_strtoupper for short_with_uc
strtoupper is 25.3x faster than mb_strtoupper for long
strtoupper is 26.21x faster than mb_strtoupper for long_with_uc
strtoupper is 8.01x faster than mb_strtoupper for short_utf8
strtoupper is 8.05x faster than mb_strtoupper for short_utf8_with_uc

This issue is that current mbstring functions are several times slower even with pure ASCII (7b) input.

[2021-06-17 17:42 UTC] dharman@php.net

On my Windows machine PHP 8.1 times look similar to this:

strtolower is 2.51x faster than mb_strtolower for empty

strtolower is 12.28x faster than mb_strtolower for short

strtolower is 8.57x faster than mb_strtolower for short_with_uc

strtolower is 873.44x faster than mb_strtolower for long

strtolower is 858.52x faster than mb_strtolower for long_with_uc

strtolower is 14.34x faster than mb_strtolower for short_utf8

strtolower is 14.25x faster than mb_strtolower for short_utf8_with_uc

The optimizations in strtolower are huge and make a lot of difference. It would be great if mbstring could be optimized as well.

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2024 The PHP Group All rights reserved.	Last updated: Fri Jul 26 23:01:30 2024 UTC