php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #81378 mb_detect_encoding() performance regression in PHP 8.1
Submitted: 2021-08-23 20:10 UTC Modified: 2021-09-20 14:33 UTC
From: gfpuba+php at gmail dot com Assigned: alexdowad (profile)
Status: Closed Package: Scripting Engine problem
PHP Version: 8.1.0beta3 OS: Win10 Apache/2.4.48
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: gfpuba+php at gmail dot com
New email:
PHP Version: OS:

 

 [2021-08-23 20:10 UTC] gfpuba+php at gmail dot com
Description:
------------
The following code takes 10 times longer to execute on PHP/8.1.0beta3 compare to PHP 8.0.8


Test script:
---------------
$a = str_repeat('abcdef', 10000);
$b = mb_detect_encoding($a, 'UTF-8', true);



Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2021-08-24 13:35 UTC] nikic@php.net
-Summary: Performance problem +Summary: mb_detect_encoding() performance regression in PHP 8.1
 [2021-08-24 13:35 UTC] nikic@php.net
Encoding detection now scores encodings based on character properties, and the property lookups are very expensive and make up the majority of the execution time now.

We don't actually need them for the case of a single encoding, but I assume that was just a minimal reduction, because calling mb_detect_encoding() with a single encoding doesn't make sense (mb_check_encoding() should be used instead).
 [2021-08-24 19:30 UTC] nikic@php.net
-Assigned To: +Assigned To: alexdowad
 [2021-08-24 19:30 UTC] nikic@php.net
I've applied a couple of obvious optimizations in:
 * https://github.com/php/php-src/commit/3be94217f4c353e718db6d823dbd74a29522fce3
 * https://github.com/php/php-src/commit/f458b16041b6c9101d2f846027aba4a4b08e5a50
 * https://github.com/php/php-src/commit/425c2e3ba1e3096f26adfef70f8afac9ae835ba9

There's more that can be done here, though I'm not sure how final the "algorithm" here is.

Maybe Alex wants to take a look as well...
 [2021-08-24 19:55 UTC] alexdowad@php.net
Dear nikic, thanks for sending this report my way. I'm not sure if you received the e-mail I sent you recently about mbstring, since there was no reply. Anyways, to reiterate what I said in the e-mail, I am just trying to get test coverage for mbstring (as reported by gcov) close to 100%, before doing any more cleanup or performance work.

After test coverage is better, I will be happy to work on this. Your point about mb_detect_encoding() with a single candidate encoding is good, and suggests we could internally detect that case and delegate to mb_check_encoding(). Do you think that would be worthwhile?
 [2021-08-25 08:41 UTC] nikic@php.net
> Dear nikic, thanks for sending this report my way. I'm not sure if you received the e-mail I sent you recently about mbstring, since there was no reply. Anyways, to reiterate what I said in the e-mail, I am just trying to get test coverage for mbstring (as reported by gcov) close to 100%, before doing any more cleanup or performance work.

Sent a reply!

> After test coverage is better, I will be happy to work on this. Your point about mb_detect_encoding() with a single candidate encoding is good, and suggests we could internally detect that case and delegate to mb_check_encoding(). Do you think that would be worthwhile?

Sounds reasonable to me. The strict=true case at least is mb_check_encoding() while strict=false ... I think that should basically be an unconditional "return true", but I think currently that checks whether the first character of the string is valid in the encoding.
 [2021-09-07 06:58 UTC] alexdowad@php.net
Checking for fixed ranges of codepoints rather than doing Unicode property lookups via hash table makes mb_detect_encoding a bit less than 4x faster. Redirecting to mb_check_encoding under the hood gives another 5% gain. (Wonder if that's even worth doing? It might be if the mb_detect_encoding algorithm becomes more complex in the future to achieve more accurate detection...)
 [2021-09-20 14:33 UTC] nikic@php.net
-Status: Assigned +Status: Closed
 [2021-09-20 14:33 UTC] nikic@php.net
This should be fixed now that property lookups are no longer used.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Mar 29 07:01:28 2024 UTC