PHP :: Bug #81390 :: mb_detect_encoding() regression

Bug #81390	mb_detect_encoding() regression
Submitted:	2021-08-26 17:03 UTC	Modified:	2021-12-07 14:30 UTC
From:	alec at alec dot pl	Assigned:	alexdowad (profile)
Status:	Closed	Package:	mbstring related
PHP Version:	8.1.0beta5	OS:
Private report:	No	CVE-ID:	None

View Developer Edit

[2021-08-26 17:03 UTC] alec at alec dot pl

Description:
------------
The same code returns "ISO-8859-1" on PHP8.0 and "UUENCODE" on PHP8.1.0beta3.

Note: the text contains some ascii with two 0x0EB characters
Note: ISO-8859-1 is before UUENCODE in the mb_list_encodings() result.
Note: Even removing UUENCODE from the list does not make it to return expected ISO-8859-1

Test script:
---------------
$test = base64_decode('Q0hBUlNFVD13aW5kb3dzLTEyNTI6RG/rO0pvaG4=');

echo mb_detect_encoding($test, mb_list_encodings());

Expected result:
----------------
ISO-8859-1

Actual result:
--------------
UUENCODE

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2021-08-26 17:10 UTC] alec at alec dot pl

Using $test = 'test:test' in the test script also will return UUENCODE. This is really wrong.

[2021-08-26 17:20 UTC] alec at alec dot pl

$test = 'test:test';

$encodings = array_diff(mb_list_encodings(), ['UUENCODE', 'wchar']);
echo mb_detect_encoding($test, $encodings, true);

returns HTML-ENTITIES, this makes no sense.

[2021-08-27 09:12 UTC] cmb@php.net

-Status: Open +Status: Verified

[2021-08-27 09:12 UTC] cmb@php.net

Even mb_check_encoding() fails in the same way:
<https://3v4l.org/klWd0/rfc>.

[2021-08-27 13:09 UTC] nikic@php.net

-Assigned To: +Assigned To: alexdowad

[2021-08-27 13:09 UTC] nikic@php.net

There are multiple issues here:

1. We should consider illegal trailing characters in non-strict mode if there are still multiple eligible encodings. Fixed in https://github.com/php/php-src/commit/43cb2548f7fd09ac3471bd71c5d28fbeaa312f2d. This prevents detection of UCS-2 for "test:test", which has an incomplete last character.

2. Some of the "special" encodings currently don't do strict validation. uuencode is one of those and thus ends up accepting everything. The filter should probably be fixed, though I believe we want to drop support for these "encodings" anyway.

3. However, the real issue here is user error. You're throwing a big bucket of ambiguous encodings at mbstring, and asking it to pick something. This kinda worked out before because in PHP 8.0 only a limited set of encodings supported encoding detection. In PHP 8.1 all encodings support detection, including multi-byte encodings. The string "test:tes" looks nice as UTF-8, but is also "瑥獴㩴敳" in UCS-2. Unless we want to bias detection towards ASCII rather than CJK, both of these are sensible choices. UCS-2 is earlier in the encoding list, and doesn't have punctuation besides.

If you want to limit detection to only UTF-8 and ISO-8859 style encodings, then you should specify that in your encoding list. If you don't want to get back UCS-2 for something that is valid UCS-2, don't specify UCS-2.

I think the only thing we could do here is to exclude various special encodings from detection (like https://gist.github.com/nikic/7cab20f7286c2b9437276c4fa43f6fb4), but apart from that I think that things are working correctly here.

Maybe Alex has some more thoughts on this.

[2021-08-27 15:32 UTC] nikic@php.net

> Unless we want to bias detection towards ASCII rather than CJK

After thinking about this a bit more, I think we might want to do something like this. I just tried detecting UTF-16LE and UTF-16BE, and this basically ends up picking whichever is first even if the string is very "obviously" one or the other based on null bytes. The reverse order can still be meaningfully interpreted as CJK code points.

[2021-08-27 15:42 UTC] nikic@php.net

It might also make sense to just artificially limit the supported encodings to something closer to PHP-8.0. This functionality has close to zero test coverage right now...

[2021-08-27 17:02 UTC] alec at alec dot pl

I guess I'll use a sane list of encodings, however there's still somethings wrong.

$test = 'test:test';
$encodings = ['UTF-8', 'SJIS', 'GB2312',
         'ISO-8859-1', 'ISO-8859-2', 'ISO-8859-3', 'ISO-8859-4',
         'ISO-8859-5', 'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9',
         'ISO-8859-10', 'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15', 'ISO-8859-16',
         'WINDOWS-1252', 'WINDOWS-1251', 'EUC-JP', 'EUC-TW', 'KOI8-R', 'BIG-5',
         'ISO-2022-KR', 'ISO-2022-JP', 'UTF-16'
];
echo mb_detect_encoding($test, $encodings);

returns "UTF-16". Maybe that's one of the issues you described already.

[2021-08-27 17:46 UTC] alexdowad@php.net

Thanks to Alec for the report! Some comments:

The new legacy encoding detection code, which (as Nikita mentioned) is intended to work with all supported encodings, uses a couple of simple heuristics:
- If the input string is not valid in a candidate encoding, that encoding is immediately rejected.
- When the input string is converted to a candidate encoding, each control character or codepoint in Unicode's Private Use Area counts for 10 "demerits" against the candidate
- Each punctuation character counts for 1 "demerit", since punctuation is much less common in natural language strings than letters (also, when Shift-JIS or ISO-2022 strings are misinterpreted as ASCII, they tend to have large numbers of punctuation characters).

We can easily add more heuristics, and refine the existing ones. We will _never_ get anything close to 100% accuracy; frankly, even with human intelligence, it is not always possible to figure out what the intended encoding of some random string is.

We should favor heuristics which will improve detection accuracy in a wide range of situations, and which are fast to evaluate.

Here are a couple of ideas:

- Consider completely banning uuencode, base64, QPrint, 'HTML entities', '7-bit', and '8-bit' from being returned as the detected text encoding.
- (Line 46 of mbfl_encoding.h is interesting; it shows that the original author of MBString recognized that these are not really 'text encodings' in the same sense that the other supported encodings are.)

- Rank the other supported encodings according to the likelihood that they will be encountered 'in the wild', and favor those higher on the list when more than one candidate is possible.

Here's another thought. Two common scenarios when encoding detection returns the wrong result:

1) The string is changed into all or almost all CJK characters, which are usually _very rare_ CJK characters.
2) (This is what happens when a CJK string is mistakenly detected as being ASCII or a 'european' encoding) The string is changed into a mishmash of letters and punctuation, with very few spaces.

This implies that it might be helpful to classify CJK characters as 'common' and 'rare', perhaps using something like a Bloom filter. For strings with a high proportion of 'european' characters, maybe we should expect a good number of spaces (unless the string is just a single word).

I think that finding PUA codepoints is a fairly good indicator that a candidate encoding might be wrong, but rather than the current test, we could simply check for a range of values (i.e. c >= PUA_MIN && c <= PUA_MAX). This might help to speed things up, since we are also seeing reports that the new encoding detection code is too slow for some users.

[2021-09-06 19:43 UTC] alexdowad@php.net

OK, just found and fixed a bug! After the fix, Alec's test case returns ISO-8859-1.

Thanks for helping us to find it, Alec!

[2021-09-06 19:55 UTC] alexdowad@php.net

-Status: Verified +Status: Closed

[2021-09-06 19:55 UTC] alexdowad@php.net

Will submit the fix in my next PR for mbstring.

[2021-09-24 06:12 UTC] alec at alec dot pl

The original case is still not fixed in 8.1.0rc2.

[2021-09-25 12:36 UTC] alec at alec dot pl

-Status: Closed +Status: Assigned

[2021-09-25 12:36 UTC] alec at alec dot pl

Can I have a clarification on which version this is fixed?

[2021-09-25 12:44 UTC] alexdowad@php.net

Hi, Alec. The fix is in commit c25a1ef8d0, which was merged a few days ago (around September 20). It should be included in the next release in the 8.1 series. I'm not involved in cutting new releases and don't know what the schedule for releases is.

Thanks again for reporting the bug.

[2021-09-25 13:00 UTC] alec at alec dot pl

-Status: Assigned +Status: Closed

[2021-09-25 13:00 UTC] alec at alec dot pl

So, it will be rc3. Thank you.

[2021-10-15 07:47 UTC] alec at alec dot pl

I don't know what's going on, but I now have rc4 and this is still not fixed.

https://3v4l.org/lfM8J
https://3v4l.org/0enJ8

3v4l does not have rc4 yet, but it was supposed to be fixed in rc3. Looks it isn't.

[2021-10-15 09:46 UTC] alec at alec dot pl

Another case:

https://3v4l.org/ritGP

[2021-10-15 11:17 UTC] alec at alec dot pl

Another one:

https://3v4l.org/QK7q6

[2021-10-17 17:20 UTC] alexdowad@php.net

Hi Alec,

The test case which you provided in the original bug report does appear to return good results now. Looks like the bug fix had a good effect:

https://3v4l.org/UXtBd

Thank you for providing more test cases in which the latest implementation of mb_detect_encoding does not return accurate results. Those will be added to the test suite for the PHP interpreter.

I have been working on another patch for mbstring which significantly enhances the accuracy of legacy encoding detection. Several of the test cases you have added in this thread already return the correct result with that new patch.

Once the next patch is merged, we would appreciate your further assistance with testing. If you are able to build PHP from source, it would be even more helpful if you can test even before the next rc. However, I understand that building a C program may seem daunting for some who have never done that before, and it is work which you didn't sign up for.

[2021-10-17 18:48 UTC] alexdowad@php.net

-Status: Closed +Status: Assigned

[2021-10-17 18:48 UTC] alexdowad@php.net

https://github.com/php/php-src/pull/7587

[2021-11-01 19:35 UTC] alexdowad@php.net

-Status: Assigned +Status: Closed

[2021-11-01 19:35 UTC] alexdowad@php.net

Closing this for now, as I believe the enhancements which have been merged in to mainline will fully address Alec's concerns. (Whenever a new public release comes out which he can install and try, that is.)

Alec, please re-open this if you still discover any problems with the detection accuracy of mb_detect_encoding() in the next rc release. On the other hand, if it works well for you, please drop a line as well.

[2021-11-01 19:40 UTC] patrickallaert@php.net

This is part of PHP 8.1.0RC5.

[2021-11-04 17:35 UTC] alec at alec dot pl

-Status: Closed +Status: Assigned -PHP Version: 8.1.0beta3 +PHP Version: 8.1.0beta5

[2021-11-04 17:35 UTC] alec at alec dot pl

After testing rc5 I have one last case.

https://3v4l.org/9o7ft

There is a reasonably short input ('Iksi' . chr(241) . 'ski') recognized as SJIS-2004, while I'd expect ISO-8859-2 or ISO-8859-1. It is also interesting that without the second argument the result would be UTF-8.

I understand that such a short input may be confusing for the detector, but maybe there's still some bug.

[2021-11-13 21:20 UTC] alexdowad@php.net

Alec, you are nothing short of a fabulous tester. Well done.

I am going to investigate your latest finding... just not right now. At the moment, I am cooking something up which will hopefully make almost every mbstring function several times faster. Soon to be served up, hot and fresh, in a upcoming PHP release. Bon appetit!

[2021-11-19 08:07 UTC] alec at alec dot pl

Nice to hear that, but we're running out of time. Any chance this bug can be fixed before the final 8.1.0 release?

[2021-11-22 18:24 UTC] alexdowad@php.net

Patrick Allaert reached out to me today to see if the latest report from Alec could be checked into before he cuts the final release for 8.1.0. Thanks very much for the 'heads up', Patrick! Gladly!

I added the following line to `mbfl_encoding_detector_judge` in mbfilter.c and recompiled:

    printf("Score for %s: %d illegal chars, %d demerits\n", filter->from->name, data->num_illegalchars, data->score);

And then ran Alec's new test case. Output:

    Score for UTF-8: 1 illegal chars, 5 demerits                                                      
    Score for ISO-8859-1: 0 illegal chars, 8 demerits                                                 
    Score for ISO-8859-2: 0 illegal chars, 37 demerits                                                
    Score for ISO-8859-3: 0 illegal chars, 8 demerits                                                 
    Score for ISO-8859-4: 0 illegal chars, 37 demerits                                                
    Score for ISO-8859-5: 0 illegal chars, 8 demerits 
    Score for ISO-8859-6: 0 illegal chars, 8 demerits 
    Score for ISO-8859-7: 0 illegal chars, 8 demerits 
    Score for ISO-8859-8: 0 illegal chars, 8 demerits 
    Score for ISO-8859-9: 0 illegal chars, 8 demerits 
    Score for ISO-8859-10: 0 illegal chars, 37 demerits
    Score for ISO-8859-13: 0 illegal chars, 37 demerits
    Score for ISO-8859-14: 0 illegal chars, 8 demerits
    Score for ISO-8859-15: 0 illegal chars, 8 demerits
    Score for ISO-8859-16: 0 illegal chars, 37 demerits
    Score for Windows-1252: 0 illegal chars, 8 demerits    
    Score for Windows-1251: 0 illegal chars, 8 demerits           
    Score for Windows-1254: 0 illegal chars, 8 demerits
    Score for EUC-JP: 1 illegal chars, 4 demerits 
    Score for EUC-TW: 1 illegal chars, 4 demerits 
    Score for KOI8-R: 0 illegal chars, 8 demerits
    Score for BIG-5: 0 illegal chars, 36 demerits                                                     
    Score for ISO-2022-KR: 1 illegal chars, 4 demerits
    Score for ISO-2022-JP: 1 illegal chars, 4 demerits                                             
    Score for GB18030: 0 illegal chars, 36 demerits
    Score for UTF-32: 1 illegal chars, 0 demerits
    Score for UTF-32BE: 1 illegal chars, 0 demerits                                                   
    Score for UTF-32LE: 1 illegal chars, 0 demerits                                                   
    Score for UTF-16: 0 illegal chars, 91 demerits  
    Score for UTF-16BE: 0 illegal chars, 91 demerits
    Score for UTF-16LE: 0 illegal chars, 91 demerits                                                  
    Score for UTF-7: 1 illegal chars, 4 demerits                                                      
    Score for UTF7-IMAP: 1 illegal chars, 4 demerits
    Score for ASCII: 1 illegal chars, 4 demerits
    Score for SJIS: 1 illegal chars, 4 demerits                                                       
    Score for eucJP-win: 1 illegal chars, 4 demerits                                                  
    Score for EUC-JP-2004: 1 illegal chars, 4 demerits
    Score for SJIS-Mobile#DOCOMO: 0 illegal chars, 36 demerits
    Score for SJIS-Mobile#KDDI: 0 illegal chars, 36 demerits                                         
    Score for SJIS-Mobile#SOFTBANK: 0 illegal chars, 36 demerits
    Score for SJIS-mac: 1 illegal chars, 4 demerits
    Score for SJIS-2004: 0 illegal chars, 7 demerits
    Score for UTF-8-Mobile#DOCOMO: 1 illegal chars, 5 demerits
    Score for UTF-8-Mobile#KDDI-A: 1 illegal chars, 5 demerits
    Score for UTF-8-Mobile#KDDI-B: 1 illegal chars, 5 demerits
    Score for UTF-8-Mobile#SOFTBANK: 1 illegal chars, 5 demerits
    Score for CP932: 0 illegal chars, 36 demerits
    Score for CP51932: 1 illegal chars, 4 demerits
    Score for JIS: 1 illegal chars, 4 demerits
    Score for ISO-2022-JP-MS: 1 illegal chars, 4 demerits
    Score for Windows-1252: 0 illegal chars, 8 demerits
    Score for Windows-1254: 0 illegal chars, 8 demerits
    Score for EUC-CN: 1 illegal chars, 4 demerits
    Score for CP936: 0 illegal chars, 36 demerits
    Score for HZ: 1 illegal chars, 4 demerits
    Score for CP950: 0 illegal chars, 36 demerits
    Score for EUC-KR: 1 illegal chars, 4 demerits
    Score for UHC: 1 illegal chars, 4 demerits
    Score for Windows-1251: 0 illegal chars, 8 demerits
    Score for CP866: 0 illegal chars, 8 demerits
    Score for KOI8-U: 0 illegal chars, 8 demerits
    Score for ArmSCII-8: 0 illegal chars, 37 demerits 
    Score for CP850: 0 illegal chars, 8 demerits
    Score for ISO-2022-JP-2004: 1 illegal chars, 4 demerits
    Score for ISO-2022-JP-MOBILE#KDDI: 1 illegal chars, 4 demerits
    Score for CP50220: 1 illegal chars, 4 demerits
    Score for CP50221: 1 illegal chars, 4 demerits
    Score for CP50222: 1 illegal chars, 4 demerits
    SJIS-2004

Key lines are:

    Score for ISO-8859-1: 0 illegal chars, 8 demerits                                                 
    Score for SJIS-2004: 0 illegal chars, 7 demerits

So what we have here is a case where the heuristics employed by mb_detect_encoding are not strong enough to detect a significant difference in likelihood between ISO-8859-1 and SJIS-2004. SJIS-2004 happens to win out by a tiny margin, and we don't get the answer which was desired.

In ISO-8859-1 the string decodes to:

Iksiñski

And in SJIS-2004:

Iksi卧ki

It may look obvious that we wanted ñs and not 卧, but the current implementation of mb_detect_encoding is based on inspecting codepoints one by one and seeing how many codepoints there are which are 'rare' across all of the world's most common languages. "ñ" and "s" are not rare (of course), and "卧" is also a fairly common word in Chinese. So mb_detect_encoding can't see any difference between the two decodings as far as rare codepoints go.

It also applies a small penalty to longer strings, which is necessary to avoid having *everything* detected as a single-byte encoding where every possible byte value decodes to a codepoint which is not rare.

Since there are no 'rare' codepoints in either decoding, and using SJIS-2004 results in a slightly shorter output than ISO-8859-1, the function goes for SJIS-2004.

I'm trying to think of a way to tweak the heuristics to get the output which Alec wants on this string, *without* making detection accuracy worse on a bunch of other possible inputs. It's tricky. We can make it provide the desired answer on this particular example, but we may trash lots and lots of other equally realistic  cases in the process.

I think the one thing we could do, which has not been done yet, is to look at *sequences* of codepoints and judge them as likely or unlikely, rather than single codepoints. That has the potential to significantly boost detection accuracy across the board, rather than on just one cherry-picked example.

Of course, doing more checks will make the function a bit slower, which is a concern. We want it to be as accurate as possible, but we also want it to be fast.

The bigger issue is where we would find the data to tell us which sequences of codepoints are likely and which are unlikely. It would require gathering a big corpus of text in various languages which we can analyze. And just a 'big' corpus doesn't guarantee that the results will be good; there has to be enough data, but it also has to be balanced, good-quality data.

Then once we have that big corpus and can measure the frequency of various sequences of codepoints, how much memory are we willing to give to the resulting tables? Right now I am using 8KB for a bit vector (1 bit for each Unicode codepoint from U+0000 to U+FFFF). It would definitely take more than that to get any useful results, but how much? I don't know. I anticipate that something like a Bloom filter would be used to avoid consuming massive gobs of memory.

Maybe rather than looking at sequences of codepoints, we would look at sequences of codepoint 'types': "A Latin character, followed by punctuation, followed by whitespace, followed by another Latin character..."

Not sure how much that would actually help us to boost accuracy. It would definitely reduce the size of the needed corpus.

Anyways, if Nikita or someone else has smarter ideas than me, I would love to hear them. Or if someone wants to help putting a good corpus together, I would be willing to write the code to use it, but gathering the corpus is more work than I am ready to do now.

Thoughts?

[2021-11-23 07:48 UTC] alec at alec dot pl

So, this is what I expected. Short input can produce unexpected results. I'm guessing that in earlier versions there was no such thing as demerits (or the algo was even more different), so the provided priority list had more impact on the result. Is that right?

I'm not sure we should consider this last case a bug anymore.

I'm not that good with the subject, but maybe you can get some ideas from https://github.com/Joungkyun/libchardet or https://github.com/CLD2Owners/cld2

ps. the input is proper iso-8859-2 string, so why such a big difference between these?:
Score for ISO-8859-2: 0 illegal chars, 37 demerits
Score for ISO-8859-1: 0 illegal chars, 8 demerits

[2021-11-24 20:05 UTC] alexdowad@php.net

In answer to Alec:

> I'm guessing that in earlier versions there was no such thing as demerits (or the algo was even more different), so the provided priority list had more impact on the result. Is that right?

Yep. The earlier versions of mb_detect_encoding only checked which encodings 
the input string was valid in, and picked the first one on the list. If the string was valid in more than one encoding, it did not do anything at all to try to figure out which one was most likely.

It also was not able to detect every text encoding supported by mbstring, only some of them.

> I'm not that good with the subject, but maybe you can get some ideas from https://github.com/Joungkyun/libchardet or https://github.com/CLD2Owners/cld2

Thanks for the references! I briefly browsed a bit of the code for libchardet. It looks like they also rely heavily on character frequency tables... but instead of having just one table of 'common' and 'rare' characters, like mbstring, they have tables *for each supported text encoding*. So they are able to say that "this codepoint rarely appears in EUC-JP encoded text", or "this codepoint often occurs in UTF-16 encoded text".

It looks like they have some other tricks as well, though I didn't read the code in enough detail to figure them all out.

One other thing here. It looks like there is already a PHP extension for libchardet. Therefore, I don't think there is much need to compete with them; if people want more sophisticated charset detection in PHP, they can just use libchardet.

Still, if there is anything we can do with a reasonable amount of effort to improve detection accuracy across the board, I am open to ideas. If someone is willing to pitch in and help with some of the work, that would be great.

You are also very right that misdetection is much more likely on short strings than on long ones.

> ps. the input is proper iso-8859-2 string, so why such a big difference between these?:

You can see the ranges of codepoints which are currently considered 'common' by mbstring here:

https://github.com/php/php-src/blob/master/ext/mbstring/common_codepoints.txt

You will notice that U+0144 (ń) is not there. Actually, I haven't included anything from the "Latin Extended-A" range. You can see all the characters in this range here:

https://www.unicode.org/charts/PDF/U0100.pdf

If you think any of these should be considered 'common', please let us know and we can tweak the table accordingly.

[2021-11-25 06:32 UTC] alec at alec dot pl

Well, let's take Polish language (which is ISO-8859-2 family) as an example. The one I know.

There's a table https://pl.wikipedia.org/wiki/Alfabet_polski#Cz%C4%99sto%C5%9B%C4%87_wyst%C4%99powania_liter (only in Polish version). According to this, some diacritical characters are quite common (~1-2%). E.g. letter Ł is more common that letter B.

I have no idea how that applies to the whole character set detection, though.

[2021-11-25 08:42 UTC] alexdowad@php.net

Thanks, Alec.

Hopefully this will be merged soon:

https://github.com/php/php-src/pull/7659/commits/889d2fe3bafdb8e9da39b2297d6e9f13a518584e

[2021-11-28 10:38 UTC] alec at alec dot pl

I've another case which is either a bug or an indication that the method can't really distinguish utf-8 from iso-8859-*/windows-125X charsets. I'm not sure it's only a Polish thing, but most likely not.

I took a sample text from https://lingua.com/pl/polski/czytanie/czas-do-szkoly/ and used it with

mb_detect_encoding($text,'utf-8,windows-1252,iso-8859-1,iso-8859-2', true);

It returns windows-1252, but it should be utf-8. If I convert the text to iso-8859-2, it still returns windows-1252.

ps. I didn't use your latest patch for Polish.

[2021-11-30 12:28 UTC] cmb@php.net

Related To: Bug #81676

[2021-12-05 16:01 UTC] alexdowad@php.net

Hi, Alec. Please keep up the great work with testing. I tried the text which you linked to:

---------------
$text = "Ola szykuje się do szkoły. Jest już w piątej klasie. Dawniej obawiała się szkoły, teraz bardzo lubi tam chodzić. W szkole nie tylko uczy się ciekawych rzeczy – spotyka też swoich kolegów i koleżanki. Najbardziej lubi spędzać czas ze swoimi przyjaciółmi z klasy – są to Monika i Michał.

Ola lubi wszystkie przedmioty. Wie, że nauka jest ważna. Najmilej spędza czas na lekcjach o przyrodzie – Ola bardzo lubi zwierzęta. W klasie Oli mieszka chomik. Wszystkie dzieci dbają o niego. Przynoszą mu jedzenie i głaszczą. Ola nie ma własnego zwierzęcia, więc chomik to kolejny powód dla którego lubi chodzić do szkoły.

Szkoła Oli jest blisko jej domu. Ola chodzi do szkoły sama. Gdy Ola była młodsza – zawsze odprowadzał ją tam tata, mama lub babcia. Przed każdym dniem szkoły Ola pakuje plecak. Sprawdza czy są wszystkie książki i zeszyty. Sprawdza też piórnik – czy jest tam linijka, długopis, ołówek, gumka i mazaki.

W przerwie między lekcjami Ola je w szkole obiad. Dzięki temu ma dużo siły na dalszą naukę. Po szkole droga do domu jest krótka, ale zajmuje Oli dużo czasu. Dlaczego? Bo to czas, w którym lubi rozmawiać z koleżankami i kolegami podczas spaceru do domu. Ola jednak wie, że nie może wrócić późno.";

echo mb_detect_encoding($text,'utf-8,windows-1252,iso-8859-1,iso-8859-2', true), "\n";
-------------

And then, in my shell:

-----------------
17:56 ~prog/php/php-src % ./sapi/cli/php test.php
UTF-8
-----------------

Looks good to me! That's on the tip of my development branch, which (crucially) does include the latest patch for Polish... the one which you said you didn't use.

Please do your best to see if you can still find breakage anywhere! If you find anything else which is arguably a bug, I will be very, very happy to fix it!

[2021-12-06 07:01 UTC] alec at alec dot pl

That's better, but did you convert it to iso-8859-2 and then try to detect?

[2021-12-07 12:07 UTC] alexdowad@php.net

Hi, Alec. OK, I tried converting the text to ISO-8859-2.

PHP 7 returns "ISO-8859-1". My new code for encoding detection returns "Windows-1252". Why the difference?? I did a bit of analysis, and found...

PHP 7 was not able to auto-detect Windows-1252 at all. PHP 8.1 is. However, the heuristics which it uses cannot tell any difference between ISO-8859-{1,2} or Windows-1252 in this case... so it returns the one which appears earlier in the list. In this case, you put Windows-1252 earlier.

Let me know if  you have any comments.

[2021-12-07 14:30 UTC] alec at alec dot pl

-Status: Assigned +Status: Closed

[2021-12-07 14:30 UTC] alec at alec dot pl

So, this is another case where detecting the charset is not possible.
I think we can close the ticket.

In my software I'm going to not depend on charset detection at all.

[2023-01-18 12:38 UTC] adrien dot foulon at tukan dot hu

This issue is still present in both 8.1 any version and 8.2 any version

Here is a snippet demonstrating the difference of execution between those versions
```php
# ensure that string is utf8	
function fvm_ensure_utf8($str) {
	$enc = mb_detect_encoding($str, mb_list_encodings(), true);
	var_dump($enc);
	if ($enc === false){
		return false; // could not detect encoding
	} else if ($enc !== "UTF-8") {
		return mb_convert_encoding($str, "UTF-8", $enc); // converted to utf8
	} else {
		return $str; // already utf8
	}
	
	# fail
	return false;
}

$css = 'input[type="radio"]:checked + img {
    border: 5px solid #0083ca;
}';
$css = fvm_ensure_utf8($css);
echo $css;
```

You can run it on phpsandbox with version 8.0 8.1 and 8.2, 8.0 will output UTF8, 8.1+ will output UTF7 and the following mb_convert_encoding drops the + in the string

[2023-01-18 12:50 UTC] adrien dot foulon at tukan dot hu

The problem only starts when you add a '+' character in the string

Here is a simpler reproduction case
-------
<?php
$str = "+";
$enc = mb_detect_encoding($str, mb_list_encodings(), true);
echo $enc."\n";
echo mb_convert_encoding($str, "UTF-8", $enc);

Expected
-------
UTF-8
+

Actual
-------
UTF-7

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Fri Dec 05 14:00:02 2025 UTC