|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2021-08-26 17:03 UTC] alec at alec dot pl
Description:
------------
The same code returns "ISO-8859-1" on PHP8.0 and "UUENCODE" on PHP8.1.0beta3.
Note: the text contains some ascii with two 0x0EB characters
Note: ISO-8859-1 is before UUENCODE in the mb_list_encodings() result.
Note: Even removing UUENCODE from the list does not make it to return expected ISO-8859-1
Test script:
---------------
$test = base64_decode('Q0hBUlNFVD13aW5kb3dzLTEyNTI6RG/rO0pvaG4=');
echo mb_detect_encoding($test, mb_list_encodings());
Expected result:
----------------
ISO-8859-1
Actual result:
--------------
UUENCODE
PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Sat Oct 25 15:00:01 2025 UTC |
I guess I'll use a sane list of encodings, however there's still somethings wrong. $test = 'test:test'; $encodings = ['UTF-8', 'SJIS', 'GB2312', 'ISO-8859-1', 'ISO-8859-2', 'ISO-8859-3', 'ISO-8859-4', 'ISO-8859-5', 'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9', 'ISO-8859-10', 'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15', 'ISO-8859-16', 'WINDOWS-1252', 'WINDOWS-1251', 'EUC-JP', 'EUC-TW', 'KOI8-R', 'BIG-5', 'ISO-2022-KR', 'ISO-2022-JP', 'UTF-16' ]; echo mb_detect_encoding($test, $encodings); returns "UTF-16". Maybe that's one of the issues you described already.Patrick Allaert reached out to me today to see if the latest report from Alec could be checked into before he cuts the final release for 8.1.0. Thanks very much for the 'heads up', Patrick! Gladly! I added the following line to `mbfl_encoding_detector_judge` in mbfilter.c and recompiled: printf("Score for %s: %d illegal chars, %d demerits\n", filter->from->name, data->num_illegalchars, data->score); And then ran Alec's new test case. Output: Score for UTF-8: 1 illegal chars, 5 demerits Score for ISO-8859-1: 0 illegal chars, 8 demerits Score for ISO-8859-2: 0 illegal chars, 37 demerits Score for ISO-8859-3: 0 illegal chars, 8 demerits Score for ISO-8859-4: 0 illegal chars, 37 demerits Score for ISO-8859-5: 0 illegal chars, 8 demerits Score for ISO-8859-6: 0 illegal chars, 8 demerits Score for ISO-8859-7: 0 illegal chars, 8 demerits Score for ISO-8859-8: 0 illegal chars, 8 demerits Score for ISO-8859-9: 0 illegal chars, 8 demerits Score for ISO-8859-10: 0 illegal chars, 37 demerits Score for ISO-8859-13: 0 illegal chars, 37 demerits Score for ISO-8859-14: 0 illegal chars, 8 demerits Score for ISO-8859-15: 0 illegal chars, 8 demerits Score for ISO-8859-16: 0 illegal chars, 37 demerits Score for Windows-1252: 0 illegal chars, 8 demerits Score for Windows-1251: 0 illegal chars, 8 demerits Score for Windows-1254: 0 illegal chars, 8 demerits Score for EUC-JP: 1 illegal chars, 4 demerits Score for EUC-TW: 1 illegal chars, 4 demerits Score for KOI8-R: 0 illegal chars, 8 demerits Score for BIG-5: 0 illegal chars, 36 demerits Score for ISO-2022-KR: 1 illegal chars, 4 demerits Score for ISO-2022-JP: 1 illegal chars, 4 demerits Score for GB18030: 0 illegal chars, 36 demerits Score for UTF-32: 1 illegal chars, 0 demerits Score for UTF-32BE: 1 illegal chars, 0 demerits Score for UTF-32LE: 1 illegal chars, 0 demerits Score for UTF-16: 0 illegal chars, 91 demerits Score for UTF-16BE: 0 illegal chars, 91 demerits Score for UTF-16LE: 0 illegal chars, 91 demerits Score for UTF-7: 1 illegal chars, 4 demerits Score for UTF7-IMAP: 1 illegal chars, 4 demerits Score for ASCII: 1 illegal chars, 4 demerits Score for SJIS: 1 illegal chars, 4 demerits Score for eucJP-win: 1 illegal chars, 4 demerits Score for EUC-JP-2004: 1 illegal chars, 4 demerits Score for SJIS-Mobile#DOCOMO: 0 illegal chars, 36 demerits Score for SJIS-Mobile#KDDI: 0 illegal chars, 36 demerits Score for SJIS-Mobile#SOFTBANK: 0 illegal chars, 36 demerits Score for SJIS-mac: 1 illegal chars, 4 demerits Score for SJIS-2004: 0 illegal chars, 7 demerits Score for UTF-8-Mobile#DOCOMO: 1 illegal chars, 5 demerits Score for UTF-8-Mobile#KDDI-A: 1 illegal chars, 5 demerits Score for UTF-8-Mobile#KDDI-B: 1 illegal chars, 5 demerits Score for UTF-8-Mobile#SOFTBANK: 1 illegal chars, 5 demerits Score for CP932: 0 illegal chars, 36 demerits Score for CP51932: 1 illegal chars, 4 demerits Score for JIS: 1 illegal chars, 4 demerits Score for ISO-2022-JP-MS: 1 illegal chars, 4 demerits Score for Windows-1252: 0 illegal chars, 8 demerits Score for Windows-1254: 0 illegal chars, 8 demerits Score for EUC-CN: 1 illegal chars, 4 demerits Score for CP936: 0 illegal chars, 36 demerits Score for HZ: 1 illegal chars, 4 demerits Score for CP950: 0 illegal chars, 36 demerits Score for EUC-KR: 1 illegal chars, 4 demerits Score for UHC: 1 illegal chars, 4 demerits Score for Windows-1251: 0 illegal chars, 8 demerits Score for CP866: 0 illegal chars, 8 demerits Score for KOI8-U: 0 illegal chars, 8 demerits Score for ArmSCII-8: 0 illegal chars, 37 demerits Score for CP850: 0 illegal chars, 8 demerits Score for ISO-2022-JP-2004: 1 illegal chars, 4 demerits Score for ISO-2022-JP-MOBILE#KDDI: 1 illegal chars, 4 demerits Score for CP50220: 1 illegal chars, 4 demerits Score for CP50221: 1 illegal chars, 4 demerits Score for CP50222: 1 illegal chars, 4 demerits SJIS-2004 Key lines are: Score for ISO-8859-1: 0 illegal chars, 8 demerits Score for SJIS-2004: 0 illegal chars, 7 demerits So what we have here is a case where the heuristics employed by mb_detect_encoding are not strong enough to detect a significant difference in likelihood between ISO-8859-1 and SJIS-2004. SJIS-2004 happens to win out by a tiny margin, and we don't get the answer which was desired. In ISO-8859-1 the string decodes to: Iksiñski And in SJIS-2004: Iksi卧ki It may look obvious that we wanted ñs and not 卧, but the current implementation of mb_detect_encoding is based on inspecting codepoints one by one and seeing how many codepoints there are which are 'rare' across all of the world's most common languages. "ñ" and "s" are not rare (of course), and "卧" is also a fairly common word in Chinese. So mb_detect_encoding can't see any difference between the two decodings as far as rare codepoints go. It also applies a small penalty to longer strings, which is necessary to avoid having *everything* detected as a single-byte encoding where every possible byte value decodes to a codepoint which is not rare. Since there are no 'rare' codepoints in either decoding, and using SJIS-2004 results in a slightly shorter output than ISO-8859-1, the function goes for SJIS-2004. I'm trying to think of a way to tweak the heuristics to get the output which Alec wants on this string, *without* making detection accuracy worse on a bunch of other possible inputs. It's tricky. We can make it provide the desired answer on this particular example, but we may trash lots and lots of other equally realistic cases in the process. I think the one thing we could do, which has not been done yet, is to look at *sequences* of codepoints and judge them as likely or unlikely, rather than single codepoints. That has the potential to significantly boost detection accuracy across the board, rather than on just one cherry-picked example. Of course, doing more checks will make the function a bit slower, which is a concern. We want it to be as accurate as possible, but we also want it to be fast. The bigger issue is where we would find the data to tell us which sequences of codepoints are likely and which are unlikely. It would require gathering a big corpus of text in various languages which we can analyze. And just a 'big' corpus doesn't guarantee that the results will be good; there has to be enough data, but it also has to be balanced, good-quality data. Then once we have that big corpus and can measure the frequency of various sequences of codepoints, how much memory are we willing to give to the resulting tables? Right now I am using 8KB for a bit vector (1 bit for each Unicode codepoint from U+0000 to U+FFFF). It would definitely take more than that to get any useful results, but how much? I don't know. I anticipate that something like a Bloom filter would be used to avoid consuming massive gobs of memory. Maybe rather than looking at sequences of codepoints, we would look at sequences of codepoint 'types': "A Latin character, followed by punctuation, followed by whitespace, followed by another Latin character..." Not sure how much that would actually help us to boost accuracy. It would definitely reduce the size of the needed corpus. Anyways, if Nikita or someone else has smarter ideas than me, I would love to hear them. Or if someone wants to help putting a good corpus together, I would be willing to write the code to use it, but gathering the corpus is more work than I am ready to do now. Thoughts?Hi, Alec. OK, I tried converting the text to ISO-8859-2. PHP 7 returns "ISO-8859-1". My new code for encoding detection returns "Windows-1252". Why the difference?? I did a bit of analysis, and found... PHP 7 was not able to auto-detect Windows-1252 at all. PHP 8.1 is. However, the heuristics which it uses cannot tell any difference between ISO-8859-{1,2} or Windows-1252 in this case... so it returns the one which appears earlier in the list. In this case, you put Windows-1252 earlier. Let me know if you have any comments.This issue is still present in both 8.1 any version and 8.2 any version Here is a snippet demonstrating the difference of execution between those versions ```php # ensure that string is utf8 function fvm_ensure_utf8($str) { $enc = mb_detect_encoding($str, mb_list_encodings(), true); var_dump($enc); if ($enc === false){ return false; // could not detect encoding } else if ($enc !== "UTF-8") { return mb_convert_encoding($str, "UTF-8", $enc); // converted to utf8 } else { return $str; // already utf8 } # fail return false; } $css = 'input[type="radio"]:checked + img { border: 5px solid #0083ca; }'; $css = fvm_ensure_utf8($css); echo $css; ``` You can run it on phpsandbox with version 8.0 8.1 and 8.2, 8.0 will output UTF8, 8.1+ will output UTF7 and the following mb_convert_encoding drops the + in the string