php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #45993 mb_detect_encoding should support UTF-16
Submitted: 2008-09-04 11:47 UTC Modified: 2016-07-31 13:56 UTC
Votes:18
Avg. Score:4.3 ± 0.8
Reproduced:18 of 18 (100.0%)
Same Version:3 (16.7%)
Same OS:2 (11.1%)
From: mtrojan at transline dot de Assigned:
Status: Open Package: mbstring related
PHP Version: 5.2.6 OS: Windows XP
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: mtrojan at transline dot de
New email:
PHP Version: OS:

 

 [2008-09-04 11:47 UTC] mtrojan at transline dot de
Description:
------------
mb_detect_encoding does not seem to recognize UTF-16 encoded files properly. Even if it is assured by using mb_check_encoding that a file is truly UTF-16LE, mb_detect_encoding does not detect the same file as UTF-16 and is returning ISO-8859-1 instead. Activating/deactivating strict mode has no influence on the result.

Reproduce code:
---------------
$content = file_get_contents($src_path);
	
$encodings = array('UTF-16', 'UTF-16LE', 'UTF-16BE', 'UTF-8', 'UNICODE', 'ISO-8859-1');

$enc = mb_detect_encoding($content, $encodings);
print "encoding: $enc\n";
	
print 'checked: ' . intval(mb_check_encoding($content, 'UTF-16LE'));

Expected result:
----------------
encoding: UTF-16LE
checked: 1

Actual result:
--------------
encoding: ISO-8859-1
checked: 1

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2008-10-26 23:01 UTC] jani@php.net
Assigned to the mbstring maintainer.
 [2008-11-08 02:20 UTC] hirokawa@php.net
mb_detect_encoding does not support the UTF-16/UTF-16BE 
encoding detection. Because UTF-16 isn't byte stream encoding like UTF-8, we cannot detect the encoding as other byte stream encoding.

The file encoded in UTF-16 can be detected easily using BOM, 
it is like,

if ($content[0]==chr(0xff) && $content[1]==chr(0xfe)) {
  echo 'UTF-16';
} else if ($content[0]==chr(0xfe) && $content[1]==chr(0xff)) {
  echo 'UTF-16BE';
}






 [2008-11-10 07:30 UTC] mtrojan at transline dot de
Of course, comparing the beginning of a file with the UTF-16 BOM can be used to detect UTF-16 encoding. But what do you do with UTF-16 encoded files where no BOM is set?
 [2012-01-02 04:22 UTC] Apollo880 at gmail dot com
Bug with correct encoding detection.

function detect_enc($str)
{
	$awe = mb_list_encodings();
	unset($awe[0], $awe[1], $awe[2]);
	foreach ($awe as $enctype)
	{
		if (mb_check_encoding($str, $enctype) === true) return $enctype;
	}
	return false;
}

echo detect_enc('String_encoded_to_Windows-1251'); // Return 'byte2be'. It's a fail.
 [2014-04-04 14:32 UTC] soapergem at gmail dot com
I came here to report essentially this same bug. In fact I think this bug is directly related to bugs 51563, 64667, 63433, and even 38138.

I have a UTF-16LE encoded CSV file and fgetcsv() was failing on it. So I got to learn all about different character encodings today!

I read on another bug report from one of the PHP devs that the purpose of mb_detect_encoding() is to "detect which multibyte encoding is in use." It's failing at that right now. When I run mb_detect_encoding() on a UTF-16 encoded string, either it says ASCII (if it is not the first line of the file), or it just returns FALSE (if it is the first line, which includes the BOM). On the other hand, if I run mb_check_encoding($str, 'UTF-16') then it seems I get TRUE for all except the first line.

I'm using PHP 5.5.10 by the way.
 [2015-12-11 17:40 UTC] nathanael at gnat dot ca
So there seems to be some regression here between 5.5 and 5.6. I have a unit test for a project. It took a UTF-16 encoded file (with BOM), copied to a tmp dir, then detects encoding and requests mb_convert_encoding($fileContent,'UTF-8') the file. On php 5.5 the file is converted to UTF-8 properly. On 5.6 (on linux and windows) and 7 (linux) it fails. The BOM becomes ?? and then the file is detected as ASCII.

Using

if ($content[0]==chr(0xff) && $content[1]==chr(0xfe)) {
  echo 'UTF-16';
}

does detect it as UTF-16, but I'd like to be able to detect the files that are multi-byte and convert them to UTF-8.
 [2016-07-31 13:56 UTC] cmb@php.net
-Summary: mb_detect_encoding and mb_check_encoding results are dissonant +Summary: mb_detect_encoding should support UTF-16 -Type: Bug +Type: Feature/Change Request
 [2016-07-31 13:56 UTC] cmb@php.net
> mb_detect_encoding does not seem to recognize UTF-16 encoded
> files properly.

That is expected behavior, that's already documented[1]:

| For UTF-16, UTF-32, UCS2 and UCS4, encoding detection will fail
| always.

I'm therefore chaning to feature request.

> The file encoded in UTF-16 can be detected easily using BOM,

Albeit not reliably, because 0xFE and 0xFF are valid ISO-8859-*
characters, for instance. Furthermore, a BOM is optional for
UTF-16.

@nathanael at gnat dot ca: that would be a different issue, so
please open a separate ticket.

[1] <http://php.net/manual/en/function.mb-detect-order.php>
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Mon Nov 25 10:01:32 2024 UTC