|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2008-09-04 11:47 UTC] mtrojan at transline dot de
Description:
------------
mb_detect_encoding does not seem to recognize UTF-16 encoded files properly. Even if it is assured by using mb_check_encoding that a file is truly UTF-16LE, mb_detect_encoding does not detect the same file as UTF-16 and is returning ISO-8859-1 instead. Activating/deactivating strict mode has no influence on the result.
Reproduce code:
---------------
$content = file_get_contents($src_path);
$encodings = array('UTF-16', 'UTF-16LE', 'UTF-16BE', 'UTF-8', 'UNICODE', 'ISO-8859-1');
$enc = mb_detect_encoding($content, $encodings);
print "encoding: $enc\n";
print 'checked: ' . intval(mb_check_encoding($content, 'UTF-16LE'));
Expected result:
----------------
encoding: UTF-16LE
checked: 1
Actual result:
--------------
encoding: ISO-8859-1
checked: 1
PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Sat Oct 25 11:00:01 2025 UTC |
mb_detect_encoding does not support the UTF-16/UTF-16BE encoding detection. Because UTF-16 isn't byte stream encoding like UTF-8, we cannot detect the encoding as other byte stream encoding. The file encoded in UTF-16 can be detected easily using BOM, it is like, if ($content[0]==chr(0xff) && $content[1]==chr(0xfe)) { echo 'UTF-16'; } else if ($content[0]==chr(0xfe) && $content[1]==chr(0xff)) { echo 'UTF-16BE'; }Bug with correct encoding detection. function detect_enc($str) { $awe = mb_list_encodings(); unset($awe[0], $awe[1], $awe[2]); foreach ($awe as $enctype) { if (mb_check_encoding($str, $enctype) === true) return $enctype; } return false; } echo detect_enc('String_encoded_to_Windows-1251'); // Return 'byte2be'. It's a fail.So there seems to be some regression here between 5.5 and 5.6. I have a unit test for a project. It took a UTF-16 encoded file (with BOM), copied to a tmp dir, then detects encoding and requests mb_convert_encoding($fileContent,'UTF-8') the file. On php 5.5 the file is converted to UTF-8 properly. On 5.6 (on linux and windows) and 7 (linux) it fails. The BOM becomes ?? and then the file is detected as ASCII. Using if ($content[0]==chr(0xff) && $content[1]==chr(0xfe)) { echo 'UTF-16'; } does detect it as UTF-16, but I'd like to be able to detect the files that are multi-byte and convert them to UTF-8.