PHP :: Request #45993 :: mb_detect_encoding should support UTF-16

mb_detect_encoding should support UTF-16

Submitted:

2008-09-04 11:47 UTC

Modified:

2016-07-31 13:56 UTC

Votes:	18
Avg. Score:	4.3 ± 0.8
Reproduced:	18 of 18 (100.0%)
Same Version:	3 (16.7%)
Same OS:	2 (11.1%)

From:

mtrojan at transline dot de

Assigned:

Status:

Open

Package:

mbstring related

PHP Version:

5.2.6

OS:

Windows XP

Private report:

CVE-ID:

None

View Developer Edit

Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.

php.net Username: php.net Password:

Quick Fix:	(description)
	Block user comment
Status:		Assign to:
Package:
Bug Type:
Summary:
From:	mtrojan at transline dot de
New email:
PHP Version:		OS:

New/Additional Comment:

[2008-09-04 11:47 UTC] mtrojan at transline dot de

Description:
------------
mb_detect_encoding does not seem to recognize UTF-16 encoded files properly. Even if it is assured by using mb_check_encoding that a file is truly UTF-16LE, mb_detect_encoding does not detect the same file as UTF-16 and is returning ISO-8859-1 instead. Activating/deactivating strict mode has no influence on the result.

Reproduce code:
---------------
$content = file_get_contents($src_path);
	
$encodings = array('UTF-16', 'UTF-16LE', 'UTF-16BE', 'UTF-8', 'UNICODE', 'ISO-8859-1');

$enc = mb_detect_encoding($content, $encodings);
print "encoding: $enc\n";
	
print 'checked: ' . intval(mb_check_encoding($content, 'UTF-16LE'));

Expected result:
----------------
encoding: UTF-16LE
checked: 1

Actual result:
--------------
encoding: ISO-8859-1
checked: 1

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2008-10-26 23:01 UTC] jani@php.net

Assigned to the mbstring maintainer.

[2008-11-08 02:20 UTC] hirokawa@php.net

mb_detect_encoding does not support the UTF-16/UTF-16BE 
encoding detection. Because UTF-16 isn't byte stream encoding like UTF-8, we cannot detect the encoding as other byte stream encoding.

The file encoded in UTF-16 can be detected easily using BOM, 
it is like,

if ($content[0]==chr(0xff) && $content[1]==chr(0xfe)) {
  echo 'UTF-16';
} else if ($content[0]==chr(0xfe) && $content[1]==chr(0xff)) {
  echo 'UTF-16BE';
}

[2008-11-10 07:30 UTC] mtrojan at transline dot de

Of course, comparing the beginning of a file with the UTF-16 BOM can be used to detect UTF-16 encoding. But what do you do with UTF-16 encoded files where no BOM is set?

[2012-01-02 04:22 UTC] Apollo880 at gmail dot com

Bug with correct encoding detection.

function detect_enc($str)
{
	$awe = mb_list_encodings();
	unset($awe[0], $awe[1], $awe[2]);
	foreach ($awe as $enctype)
	{
		if (mb_check_encoding($str, $enctype) === true) return $enctype;
	}
	return false;
}

echo detect_enc('String_encoded_to_Windows-1251'); // Return 'byte2be'. It's a fail.

[2014-04-04 14:32 UTC] soapergem at gmail dot com

I came here to report essentially this same bug. In fact I think this bug is directly related to bugs 51563, 64667, 63433, and even 38138.

I have a UTF-16LE encoded CSV file and fgetcsv() was failing on it. So I got to learn all about different character encodings today!

I read on another bug report from one of the PHP devs that the purpose of mb_detect_encoding() is to "detect which multibyte encoding is in use." It's failing at that right now. When I run mb_detect_encoding() on a UTF-16 encoded string, either it says ASCII (if it is not the first line of the file), or it just returns FALSE (if it is the first line, which includes the BOM). On the other hand, if I run mb_check_encoding($str, 'UTF-16') then it seems I get TRUE for all except the first line.

I'm using PHP 5.5.10 by the way.

[2015-12-11 17:40 UTC] nathanael at gnat dot ca

So there seems to be some regression here between 5.5 and 5.6. I have a unit test for a project. It took a UTF-16 encoded file (with BOM), copied to a tmp dir, then detects encoding and requests mb_convert_encoding($fileContent,'UTF-8') the file. On php 5.5 the file is converted to UTF-8 properly. On 5.6 (on linux and windows) and 7 (linux) it fails. The BOM becomes ?? and then the file is detected as ASCII.

Using

if ($content[0]==chr(0xff) && $content[1]==chr(0xfe)) {
  echo 'UTF-16';
}

does detect it as UTF-16, but I'd like to be able to detect the files that are multi-byte and convert them to UTF-8.

[2016-07-31 13:56 UTC] cmb@php.net

-Summary: mb_detect_encoding and mb_check_encoding results are dissonant +Summary: mb_detect_encoding should support UTF-16 -Type: Bug +Type: Feature/Change Request

[2016-07-31 13:56 UTC] cmb@php.net

> mb_detect_encoding does not seem to recognize UTF-16 encoded
> files properly.

That is expected behavior, that's already documented[1]:

| For UTF-16, UTF-32, UCS2 and UCS4, encoding detection will fail
| always.

I'm therefore chaning to feature request.

> The file encoded in UTF-16 can be detected easily using BOM,

Albeit not reliably, because 0xFE and 0xFF are valid ISO-8859-*
characters, for instance. Furthermore, a BOM is optional for
UTF-16.

@nathanael at gnat dot ca: that would be a different issue, so
please open a separate ticket.

[1] <http://php.net/manual/en/function.mb-detect-order.php>

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Thu Jul 03 03:01:33 2025 UTC