php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #81390 mb_detect_encoding() regression
Submitted: 2021-08-26 17:03 UTC Modified: 2021-11-25 08:42 UTC
From: alec at alec dot pl Assigned: alexdowad (profile)
Status: Assigned Package: mbstring related
PHP Version: 8.1.0beta5 OS:
Private report: No CVE-ID: None
Have you experienced this issue?
Rate the importance of this bug to you:

 [2021-08-26 17:03 UTC] alec at alec dot pl
Description:
------------
The same code returns "ISO-8859-1" on PHP8.0 and "UUENCODE" on PHP8.1.0beta3.

Note: the text contains some ascii with two 0x0EB characters
Note: ISO-8859-1 is before UUENCODE in the mb_list_encodings() result.
Note: Even removing UUENCODE from the list does not make it to return expected ISO-8859-1

Test script:
---------------
$test = base64_decode('Q0hBUlNFVD13aW5kb3dzLTEyNTI6RG/rO0pvaG4=');

echo mb_detect_encoding($test, mb_list_encodings());

Expected result:
----------------
ISO-8859-1

Actual result:
--------------
UUENCODE

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2021-08-26 17:10 UTC] alec at alec dot pl
Using $test = 'test:test' in the test script also will return UUENCODE. This is really wrong.
 [2021-08-26 17:20 UTC] alec at alec dot pl
$test = 'test:test';

$encodings = array_diff(mb_list_encodings(), ['UUENCODE', 'wchar']);
echo mb_detect_encoding($test, $encodings, true);

returns HTML-ENTITIES, this makes no sense.
 [2021-08-27 09:12 UTC] cmb@php.net
-Status: Open +Status: Verified
 [2021-08-27 09:12 UTC] cmb@php.net
Even mb_check_encoding() fails in the same way:
<https://3v4l.org/klWd0/rfc>.
 [2021-08-27 13:09 UTC] nikic@php.net
-Assigned To: +Assigned To: alexdowad
 [2021-08-27 13:09 UTC] nikic@php.net
There are multiple issues here:

1. We should consider illegal trailing characters in non-strict mode if there are still multiple eligible encodings. Fixed in https://github.com/php/php-src/commit/43cb2548f7fd09ac3471bd71c5d28fbeaa312f2d. This prevents detection of UCS-2 for "test:test", which has an incomplete last character.

2. Some of the "special" encodings currently don't do strict validation. uuencode is one of those and thus ends up accepting everything. The filter should probably be fixed, though I believe we want to drop support for these "encodings" anyway.

3. However, the real issue here is user error. You're throwing a big bucket of ambiguous encodings at mbstring, and asking it to pick something. This kinda worked out before because in PHP 8.0 only a limited set of encodings supported encoding detection. In PHP 8.1 all encodings support detection, including multi-byte encodings. The string "test:tes" looks nice as UTF-8, but is also "瑥獴㩴敳" in UCS-2. Unless we want to bias detection towards ASCII rather than CJK, both of these are sensible choices. UCS-2 is earlier in the encoding list, and doesn't have punctuation besides.

If you want to limit detection to only UTF-8 and ISO-8859 style encodings, then you should specify that in your encoding list. If you don't want to get back UCS-2 for something that is valid UCS-2, don't specify UCS-2.

I think the only thing we could do here is to exclude various special encodings from detection (like https://gist.github.com/nikic/7cab20f7286c2b9437276c4fa43f6fb4), but apart from that I think that things are working correctly here.

Maybe Alex has some more thoughts on this.
 [2021-08-27 15:32 UTC] nikic@php.net
> Unless we want to bias detection towards ASCII rather than CJK

After thinking about this a bit more, I think we might want to do something like this. I just tried detecting UTF-16LE and UTF-16BE, and this basically ends up picking whichever is first even if the string is very "obviously" one or the other based on null bytes. The reverse order can still be meaningfully interpreted as CJK code points.
 [2021-08-27 15:42 UTC] nikic@php.net
It might also make sense to just artificially limit the supported encodings to something closer to PHP-8.0. This functionality has close to zero test coverage right now...
 [2021-08-27 17:02 UTC] alec at alec dot pl
I guess I'll use a sane list of encodings, however there's still somethings wrong.

$test = 'test:test';
$encodings = ['UTF-8', 'SJIS', 'GB2312',
         'ISO-8859-1', 'ISO-8859-2', 'ISO-8859-3', 'ISO-8859-4',
         'ISO-8859-5', 'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9',
         'ISO-8859-10', 'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15', 'ISO-8859-16',
         'WINDOWS-1252', 'WINDOWS-1251', 'EUC-JP', 'EUC-TW', 'KOI8-R', 'BIG-5',
         'ISO-2022-KR', 'ISO-2022-JP', 'UTF-16'
];
echo mb_detect_encoding($test, $encodings);

returns "UTF-16". Maybe that's one of the issues you described already.
 [2021-08-27 17:46 UTC] alexdowad@php.net
Thanks to Alec for the report! Some comments:

The new legacy encoding detection code, which (as Nikita mentioned) is intended to work with all supported encodings, uses a couple of simple heuristics:
  - If the input string is not valid in a candidate encoding, that encoding is immediately rejected.
  - When the input string is converted to a candidate encoding, each control character or codepoint in Unicode's Private Use Area counts for 10 "demerits" against the candidate
  - Each punctuation character counts for 1 "demerit", since punctuation is much less common in natural language strings than letters (also, when Shift-JIS or ISO-2022 strings are misinterpreted as ASCII, they tend to have large numbers of punctuation characters).

We can easily add more heuristics, and refine the existing ones. We will _never_ get anything close to 100% accuracy; frankly, even with human intelligence, it is not always possible to figure out what the intended encoding of some random string is.

We should favor heuristics which will improve detection accuracy in a wide range of situations, and which are fast to evaluate.

Here are a couple of ideas:

- Consider completely banning uuencode, base64, QPrint, 'HTML entities', '7-bit', and '8-bit' from being returned as the detected text encoding.
  - (Line 46 of mbfl_encoding.h is interesting; it shows that the original author of MBString recognized that these are not really 'text encodings' in the same sense that the other supported encodings are.)

- Rank the other supported encodings according to the likelihood that they will be encountered 'in the wild', and favor those higher on the list when more than one candidate is possible.

Here's another thought. Two common scenarios when encoding detection returns the wrong result:

1) The string is changed into all or almost all CJK characters, which are usually _very rare_ CJK characters.
2) (This is what happens when a CJK string is mistakenly detected as being ASCII or a 'european' encoding) The string is changed into a mishmash of letters and punctuation, with very few spaces.

This implies that it might be helpful to classify CJK characters as 'common' and 'rare', perhaps using something like a Bloom filter. For strings with a high proportion of 'european' characters, maybe we should expect a good number of spaces (unless the string is just a single word).

I think that finding PUA codepoints is a fairly good indicator that a candidate encoding might be wrong, but rather than the current test, we could simply check for a range of values (i.e. c >= PUA_MIN && c <= PUA_MAX). This might help to speed things up, since we are also seeing reports that the new encoding detection code is too slow for some users.
 [2021-09-06 19:43 UTC] alexdowad@php.net
OK, just found and fixed a bug! After the fix, Alec's test case returns ISO-8859-1.

Thanks for helping us to find it, Alec!
 [2021-09-06 19:55 UTC] alexdowad@php.net
-Status: Verified +Status: Closed
 [2021-09-06 19:55 UTC] alexdowad@php.net
Will submit the fix in my next PR for mbstring.
 [2021-09-24 06:12 UTC] alec at alec dot pl
The original case is still not fixed in 8.1.0rc2.
 [2021-09-25 12:36 UTC] alec at alec dot pl
-Status: Closed +Status: Assigned
 [2021-09-25 12:36 UTC] alec at alec dot pl
Can I have a clarification on which version this is fixed?
 [2021-09-25 12:44 UTC] alexdowad@php.net
Hi, Alec. The fix is in commit c25a1ef8d0, which was merged a few days ago (around September 20). It should be included in the next release in the 8.1 series. I'm not involved in cutting new releases and don't know what the schedule for releases is.

Thanks again for reporting the bug.
 [2021-09-25 13:00 UTC] alec at alec dot pl
-Status: Assigned +Status: Closed
 [2021-09-25 13:00 UTC] alec at alec dot pl
So, it will be rc3. Thank you.
 [2021-10-15 07:47 UTC] alec at alec dot pl
I don't know what's going on, but I now have rc4 and this is still not fixed.

https://3v4l.org/lfM8J
https://3v4l.org/0enJ8

3v4l does not have rc4 yet, but it was supposed to be fixed in rc3. Looks it isn't.
 [2021-10-15 09:46 UTC] alec at alec dot pl
Another case:

https://3v4l.org/ritGP
 [2021-10-15 11:17 UTC] alec at alec dot pl
Another one:

https://3v4l.org/QK7q6
 [2021-10-17 17:20 UTC] alexdowad@php.net
Hi Alec,

The test case which you provided in the original bug report does appear to return good results now. Looks like the bug fix had a good effect:

https://3v4l.org/UXtBd

Thank you for providing more test cases in which the latest implementation of mb_detect_encoding does not return accurate results. Those will be added to the test suite for the PHP interpreter.

I have been working on another patch for mbstring which significantly enhances the accuracy of legacy encoding detection. Several of the test cases you have added in this thread already return the correct result with that new patch.

Once the next patch is merged, we would appreciate your further assistance with testing. If you are able to build PHP from source, it would be even more helpful if you can test even before the next rc. However, I understand that building a C program may seem daunting for some who have never done that before, and it is work which you didn't sign up for.
 [2021-10-17 18:48 UTC] alexdowad@php.net
-Status: Closed +Status: Assigned
 [2021-11-01 19:35 UTC] alexdowad@php.net
-Status: Assigned +Status: Closed
 [2021-11-01 19:35 UTC] alexdowad@php.net
Closing this for now, as I believe the enhancements which have been merged in to mainline will fully address Alec's concerns. (Whenever a new public release comes out which he can install and try, that is.)

Alec, please re-open this if you still discover any problems with the detection accuracy of mb_detect_encoding() in the next rc release. On the other hand, if it works well for you, please drop a line as well.
 [2021-11-01 19:40 UTC] patrickallaert@php.net
This is part of PHP 8.1.0RC5.
 [2021-11-04 17:35 UTC] alec at alec dot pl
-Status: Closed +Status: Assigned -PHP Version: 8.1.0beta3 +PHP Version: 8.1.0beta5
 [2021-11-04 17:35 UTC] alec at alec dot pl
After testing rc5 I have one last case.

https://3v4l.org/9o7ft

There is a reasonably short input ('Iksi' . chr(241) . 'ski') recognized as SJIS-2004, while I'd expect ISO-8859-2 or ISO-8859-1. It is also interesting that without the second argument the result would be UTF-8.

I understand that such a short input may be confusing for the detector, but maybe there's still some bug.
 [2021-11-13 21:20 UTC] alexdowad@php.net
Alec, you are nothing short of a fabulous tester. Well done.

I am going to investigate your latest finding... just not right now. At the moment, I am cooking something up which will hopefully make almost every mbstring function several times faster. Soon to be served up, hot and fresh, in a upcoming PHP release. Bon appetit!
 [2021-11-19 08:07 UTC] alec at alec dot pl
Nice to hear that, but we're running out of time. Any chance this bug can be fixed before the final 8.1.0 release?
 [2021-11-22 18:24 UTC] alexdowad@php.net
Patrick Allaert reached out to me today to see if the latest report from Alec could be checked into before he cuts the final release for 8.1.0. Thanks very much for the 'heads up', Patrick! Gladly!

I added the following line to `mbfl_encoding_detector_judge` in mbfilter.c and recompiled:

    printf("Score for %s: %d illegal chars, %d demerits\n", filter->from->name, data->num_illegalchars, data->score);

And then ran Alec's new test case. Output:

    Score for UTF-8: 1 illegal chars, 5 demerits                                                      
    Score for ISO-8859-1: 0 illegal chars, 8 demerits                                                 
    Score for ISO-8859-2: 0 illegal chars, 37 demerits                                                
    Score for ISO-8859-3: 0 illegal chars, 8 demerits                                                 
    Score for ISO-8859-4: 0 illegal chars, 37 demerits                                                
    Score for ISO-8859-5: 0 illegal chars, 8 demerits 
    Score for ISO-8859-6: 0 illegal chars, 8 demerits 
    Score for ISO-8859-7: 0 illegal chars, 8 demerits 
    Score for ISO-8859-8: 0 illegal chars, 8 demerits 
    Score for ISO-8859-9: 0 illegal chars, 8 demerits 
    Score for ISO-8859-10: 0 illegal chars, 37 demerits
    Score for ISO-8859-13: 0 illegal chars, 37 demerits
    Score for ISO-8859-14: 0 illegal chars, 8 demerits
    Score for ISO-8859-15: 0 illegal chars, 8 demerits
    Score for ISO-8859-16: 0 illegal chars, 37 demerits
    Score for Windows-1252: 0 illegal chars, 8 demerits    
    Score for Windows-1251: 0 illegal chars, 8 demerits           
    Score for Windows-1254: 0 illegal chars, 8 demerits
    Score for EUC-JP: 1 illegal chars, 4 demerits 
    Score for EUC-TW: 1 illegal chars, 4 demerits 
    Score for KOI8-R: 0 illegal chars, 8 demerits
    Score for BIG-5: 0 illegal chars, 36 demerits                                                     
    Score for ISO-2022-KR: 1 illegal chars, 4 demerits
    Score for ISO-2022-JP: 1 illegal chars, 4 demerits                                             
    Score for GB18030: 0 illegal chars, 36 demerits
    Score for UTF-32: 1 illegal chars, 0 demerits
    Score for UTF-32BE: 1 illegal chars, 0 demerits                                                   
    Score for UTF-32LE: 1 illegal chars, 0 demerits                                                   
    Score for UTF-16: 0 illegal chars, 91 demerits  
    Score for UTF-16BE: 0 illegal chars, 91 demerits
    Score for UTF-16LE: 0 illegal chars, 91 demerits                                                  
    Score for UTF-7: 1 illegal chars, 4 demerits                                                      
    Score for UTF7-IMAP: 1 illegal chars, 4 demerits
    Score for ASCII: 1 illegal chars, 4 demerits
    Score for SJIS: 1 illegal chars, 4 demerits                                                       
    Score for eucJP-win: 1 illegal chars, 4 demerits                                                  
    Score for EUC-JP-2004: 1 illegal chars, 4 demerits
    Score for SJIS-Mobile#DOCOMO: 0 illegal chars, 36 demerits
    Score for SJIS-Mobile#KDDI: 0 illegal chars, 36 demerits                                         
    Score for SJIS-Mobile#SOFTBANK: 0 illegal chars, 36 demerits
    Score for SJIS-mac: 1 illegal chars, 4 demerits
    Score for SJIS-2004: 0 illegal chars, 7 demerits
    Score for UTF-8-Mobile#DOCOMO: 1 illegal chars, 5 demerits
    Score for UTF-8-Mobile#KDDI-A: 1 illegal chars, 5 demerits
    Score for UTF-8-Mobile#KDDI-B: 1 illegal chars, 5 demerits
    Score for UTF-8-Mobile#SOFTBANK: 1 illegal chars, 5 demerits
    Score for CP932: 0 illegal chars, 36 demerits
    Score for CP51932: 1 illegal chars, 4 demerits
    Score for JIS: 1 illegal chars, 4 demerits
    Score for ISO-2022-JP-MS: 1 illegal chars, 4 demerits
    Score for Windows-1252: 0 illegal chars, 8 demerits
    Score for Windows-1254: 0 illegal chars, 8 demerits
    Score for EUC-CN: 1 illegal chars, 4 demerits
    Score for CP936: 0 illegal chars, 36 demerits
    Score for HZ: 1 illegal chars, 4 demerits
    Score for CP950: 0 illegal chars, 36 demerits
    Score for EUC-KR: 1 illegal chars, 4 demerits
    Score for UHC: 1 illegal chars, 4 demerits
    Score for Windows-1251: 0 illegal chars, 8 demerits
    Score for CP866: 0 illegal chars, 8 demerits
    Score for KOI8-U: 0 illegal chars, 8 demerits
    Score for ArmSCII-8: 0 illegal chars, 37 demerits 
    Score for CP850: 0 illegal chars, 8 demerits
    Score for ISO-2022-JP-2004: 1 illegal chars, 4 demerits
    Score for ISO-2022-JP-MOBILE#KDDI: 1 illegal chars, 4 demerits
    Score for CP50220: 1 illegal chars, 4 demerits
    Score for CP50221: 1 illegal chars, 4 demerits
    Score for CP50222: 1 illegal chars, 4 demerits
    SJIS-2004

Key lines are:

    Score for ISO-8859-1: 0 illegal chars, 8 demerits                                                 
    Score for SJIS-2004: 0 illegal chars, 7 demerits

So what we have here is a case where the heuristics employed by mb_detect_encoding are not strong enough to detect a significant difference in likelihood between ISO-8859-1 and SJIS-2004. SJIS-2004 happens to win out by a tiny margin, and we don't get the answer which was desired.

In ISO-8859-1 the string decodes to:

Iksiñski

And in SJIS-2004:

Iksi卧ki

It may look obvious that we wanted ñs and not 卧, but the current implementation of mb_detect_encoding is based on inspecting codepoints one by one and seeing how many codepoints there are which are 'rare' across all of the world's most common languages. "ñ" and "s" are not rare (of course), and "卧" is also a fairly common word in Chinese. So mb_detect_encoding can't see any difference between the two decodings as far as rare codepoints go.

It also applies a small penalty to longer strings, which is necessary to avoid having *everything* detected as a single-byte encoding where every possible byte value decodes to a codepoint which is not rare.

Since there are no 'rare' codepoints in either decoding, and using SJIS-2004 results in a slightly shorter output than ISO-8859-1, the function goes for SJIS-2004.

I'm trying to think of a way to tweak the heuristics to get the output which Alec wants on this string, *without* making detection accuracy worse on a bunch of other possible inputs. It's tricky. We can make it provide the desired answer on this particular example, but we may trash lots and lots of other equally realistic  cases in the process.

I think the one thing we could do, which has not been done yet, is to look at *sequences* of codepoints and judge them as likely or unlikely, rather than single codepoints. That has the potential to significantly boost detection accuracy across the board, rather than on just one cherry-picked example.

Of course, doing more checks will make the function a bit slower, which is a concern. We want it to be as accurate as possible, but we also want it to be fast.

The bigger issue is where we would find the data to tell us which sequences of codepoints are likely and which are unlikely. It would require gathering a big corpus of text in various languages which we can analyze. And just a 'big' corpus doesn't guarantee that the results will be good; there has to be enough data, but it also has to be balanced, good-quality data.

Then once we have that big corpus and can measure the frequency of various sequences of codepoints, how much memory are we willing to give to the resulting tables? Right now I am using 8KB for a bit vector (1 bit for each Unicode codepoint from U+0000 to U+FFFF). It would definitely take more than that to get any useful results, but how much? I don't know. I anticipate that something like a Bloom filter would be used to avoid consuming massive gobs of memory.

Maybe rather than looking at sequences of codepoints, we would look at sequences of codepoint 'types': "A Latin character, followed by punctuation, followed by whitespace, followed by another Latin character..."

Not sure how much that would actually help us to boost accuracy. It would definitely reduce the size of the needed corpus.

Anyways, if Nikita or someone else has smarter ideas than me, I would love to hear them. Or if someone wants to help putting a good corpus together, I would be willing to write the code to use it, but gathering the corpus is more work than I am ready to do now.

Thoughts?
 [2021-11-23 07:48 UTC] alec at alec dot pl
So, this is what I expected. Short input can produce unexpected results. I'm guessing that in earlier versions there was no such thing as demerits (or the algo was even more different), so the provided priority list had more impact on the result. Is that right?

I'm not sure we should consider this last case a bug anymore.

I'm not that good with the subject, but maybe you can get some ideas from https://github.com/Joungkyun/libchardet or https://github.com/CLD2Owners/cld2

ps. the input is proper iso-8859-2 string, so why such a big difference between these?:
Score for ISO-8859-2: 0 illegal chars, 37 demerits
Score for ISO-8859-1: 0 illegal chars, 8 demerits
 [2021-11-24 20:05 UTC] alexdowad@php.net
In answer to Alec:

> I'm guessing that in earlier versions there was no such thing as demerits (or the algo was even more different), so the provided priority list had more impact on the result. Is that right?

Yep. The earlier versions of mb_detect_encoding only checked which encodings 
the input string was valid in, and picked the first one on the list. If the string was valid in more than one encoding, it did not do anything at all to try to figure out which one was most likely.

It also was not able to detect every text encoding supported by mbstring, only some of them.

> I'm not that good with the subject, but maybe you can get some ideas from https://github.com/Joungkyun/libchardet or https://github.com/CLD2Owners/cld2

Thanks for the references! I briefly browsed a bit of the code for libchardet. It looks like they also rely heavily on character frequency tables... but instead of having just one table of 'common' and 'rare' characters, like mbstring, they have tables *for each supported text encoding*. So they are able to say that "this codepoint rarely appears in EUC-JP encoded text", or "this codepoint often occurs in UTF-16 encoded text".

It looks like they have some other tricks as well, though I didn't read the code in enough detail to figure them all out.

One other thing here. It looks like there is already a PHP extension for libchardet. Therefore, I don't think there is much need to compete with them; if people want more sophisticated charset detection in PHP, they can just use libchardet.

Still, if there is anything we can do with a reasonable amount of effort to improve detection accuracy across the board, I am open to ideas. If someone is willing to pitch in and help with some of the work, that would be great.

You are also very right that misdetection is much more likely on short strings than on long ones.

> ps. the input is proper iso-8859-2 string, so why such a big difference between these?:

You can see the ranges of codepoints which are currently considered 'common' by mbstring here:

https://github.com/php/php-src/blob/master/ext/mbstring/common_codepoints.txt

You will notice that U+0144 (ń) is not there. Actually, I haven't included anything from the "Latin Extended-A" range. You can see all the characters in this range here:

https://www.unicode.org/charts/PDF/U0100.pdf

If you think any of these should be considered 'common', please let us know and we can tweak the table accordingly.
 [2021-11-25 06:32 UTC] alec at alec dot pl
Well, let's take Polish language (which is ISO-8859-2 family) as an example. The one I know.

There's a table https://pl.wikipedia.org/wiki/Alfabet_polski#Cz%C4%99sto%C5%9B%C4%87_wyst%C4%99powania_liter (only in Polish version). According to this, some diacritical characters are quite common (~1-2%). E.g. letter Ł is more common that letter B.

I have no idea how that applies to the whole character set detection, though.
 [2021-11-25 08:42 UTC] alexdowad@php.net
Thanks, Alec.

Hopefully this will be merged soon:

https://github.com/php/php-src/pull/7659/commits/889d2fe3bafdb8e9da39b2297d6e9f13a518584e
 
PHP Copyright © 2001-2021 The PHP Group
All rights reserved.
Last updated: Sun Nov 28 09:03:14 2021 UTC