PHP :: Bug #34776 :: mb_convert_encoding() - wrong convertion from UTF-16 (problem with BOM)

Bug #34776

mb_convert_encoding() - wrong convertion from UTF-16 (problem with BOM)

Submitted:

2005-10-07 11:47 UTC

Modified:

2005-10-15 01:00 UTC

Votes:	13
Avg. Score:	4.5 ± 0.7
Reproduced:	11 of 11 (100.0%)
Same Version:	3 (27.3%)
Same OS:	9 (81.8%)

From:

narzeczony at zabuchy dot net

Assigned:

Status:

No Feedback

Package:

mbstring related

PHP Version:

5.0.5

OS:

Linux, Windows

Private report:

CVE-ID:

None

View Add Comment Developer Edit

Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.

php.net Username: php.net Password:

Quick Fix:	(description)
	Block user comment
Status:		Assign to:
Package:
Bug Type:
Summary:
From:	narzeczony at zabuchy dot net
New email:
PHP Version:		OS:

New/Additional Comment:

[2005-10-07 11:47 UTC] narzeczony at zabuchy dot net

Description:
------------
When converting from UTF-16 (to ISO-8859-1 for example) BOM section (2 first bytes of UTF-16 text) should be removed, while mb_convert_encoding function is trying to convert them.
Problem is similar to bug #22108 but maybe this one can be fixed. 

Reproduce code:
---------------
$iso_8859_1 = 'Nexor';
$utf16LE = mb_convert_encoding($iso_8859_1,'UTF-16LE','ISO-8859-1');
$utf16BE = mb_convert_encoding($iso_8859_1,'UTF-16BE','ISO-8859-1');

//lets convert both to UTF-16
//the only difference is 2 byte long BOM field added at the beggining
// \xFF\xFE for little endian
$utf16LE = "\xFF\xFE".$utf16LE;
foreach (str_split($utf16LE) as $l) {echo ord($l).' ';}
echo ' --> ';
$utf16LE2iso = mb_convert_encoding($utf16LE,'ISO-8859-1','UTF-16');
var_dump($utf16LE2iso);

echo '<br/>';

// \xFE\xFF for big endian
$utf16BE = "\xFE\xFF".$utf16BE;
foreach (str_split($utf16BE) as $l) {echo ord($l).' ';}
echo ' --> ';
$utf16BE2iso = mb_convert_encoding($utf16BE,'ISO-8859-1','UTF-16');
var_dump($utf16BE2iso);


Expected result:
----------------
255 254 78 0 101 0 120 0 111 0 114 0 --> string(5) "Nexor"
254 255 0 78 0 101 0 120 0 111 0 114 --> string(5) "Nexor"


Actual result:
--------------
255 254 78 0 101 0 120 0 111 0 114 0 --> string(6) "??exor"
254 255 0 78 0 101 0 120 0 111 0 114 --> string(6) "?Nexor"

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports

[2005-10-07 11:52 UTC] narzeczony at zabuchy dot net

There is also small typo in documentation but I dont want to open another bug.
On http://ie.php.net/mbstring this section is repeated twice:

Name in the IANA character set registry: UTF-16BE
Underlying character set: Unicode
Description: See above.
Additional note: In contrast to UTF-16, strings are always assumed to be in big endian form. 

While one should be about UTF-16BE and other about UTF-16LE.

[2005-10-07 11:57 UTC] derick@php.net

I think this is correct as you are not supposed to supply a BOM if you specify which endianness your UTF16 stream is in.

[2005-10-07 12:33 UTC] narzeczony at zabuchy dot net

I'm not specifying which endianess mb_convert_encoding should use to convert to ISO. Look:
$utf16LE2iso = mb_convert_encoding($utf16LE,'ISO-8859-1','UTF-16');

I'm converting from UTF-16 (LE or BE) to ISO-8859-1. It looks like mb_convert_encoding is checking BOM field and choosing right encoding (if you remove BOM field it won't be converted properly for one endianess). The only problem is that BOM is not ignored.

The first two lines with endianess specified:
$utf16LE = mb_convert_encoding($iso_8859_1,'UTF-16LE','ISO-8859-1');
$utf16BE = mb_convert_encoding($iso_8859_1,'UTF-16BE','ISO-8859-1'); are just for convient UTF-16 string creation, please ignore them.

[2005-10-07 12:43 UTC] derick@php.net

ah, mbstring has a weird parameter order (dest, src) instead of (src, dest)... did you try to use iconv perhaps?

[2005-10-07 16:36 UTC] narzeczony at zabuchy dot net

The same example but with iconv instead of mb_convert_encoding works perfect - but it doesn't close bug related to mb_convert_encoding I guess :).

Another problem exist with converting to 'UTF-16' (using mb_convert_encoding) - BOM section is not added. Again iconv works well in this case.

[2005-10-07 21:58 UTC] sniper@php.net

Please try using this CVS snapshot:

  http://snaps.php.net/php5-latest.tar.gz
 
For Windows:
 
  http://snaps.php.net/win32/php5-win32-latest.zip

[2005-10-15 01:00 UTC] php-bugs at lists dot php dot net

No feedback was provided for this bug for over a week, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

[2006-06-23 16:11 UTC] markl at lindenlab dot com

There are two problems when mb_convert_encoding is 
converting from UTF-16:

1) It is including the (transcoded) BOM in the result, 
rather than stripping it

2) If the source UTF-16 string was little endian, then the 
second character of the conversion will be wrong; it is 
converted as if the character code had 0xFF00 or'd into it.

Problem 1 occurs with any UTF-16 variant (though it is 
arguably correct behavior for UTF-16LE and UTF-16BE).  
Problem 2 only occurs when converting from UTF-16.

This PHP program demonstrates this all clearly:



function dump($s)
{
	for ($i = 0; $i < strlen($s); ++$i) {
		echo substr(dechex(256+ord(substr($s, $i, 1))), 1, 
2),  ' ';
	}
	var_dump($s);
}

$utf16le = "\xFF\xFE\x41\x00\x42\x00\x43\x00";
$utf16be = "\xFE\xFF\x00\x41\x00\x42\x00\x43";
	// these strings are both valid UTF-16, the BOM at the 
start indicates
	// the endianness.  We don't expect the BOM to be 
included in a conversion

echo "The UTF-16LE and UTF-16BE sequences:\n";
dump($utf16le);
dump($utf16be);
echo "\n";

$encodings = array("ascii", "iso-8859-1", "utf-8", "utf-16", 
"utf-16le", "utf-16be");

foreach ($encodings as $enc) {
	echo "Converting to $enc:\n";
	dump(mb_convert_encoding($utf16le, $enc, "utf-16"));
	dump(mb_convert_encoding($utf16be, $enc, "utf-16"));
	echo "\n";
}

[2008-02-18 17:16 UTC] jdephix at polenord dot com

UTF-16LE and UTF-16BE seem mixed up when using mb_convert_encoding.

I want to read the content of a file in UTF-16BE (starts with \xFE\xFF) and convert it into UTF-8:

$s = file_get_contents($fileUTF16BE);
$s = mb_convert_encoding($s, 'UTF-8', "UTF-16BE");
//some operations on $s
file_put_contents($anotherUTF16BEfile, mb_convert_encoding($s, 'UTF-16BE', "UTF-8"));

The second file is in Little Endian (starts with \xFF\FE)!!!

I have to specify LE if I want BE.
file_put_contents($anotherUTF16BEfile, mb_convert_encoding($s, 'UTF-16LE', "UTF-8"));

How come it's reversed?

[2008-02-18 17:20 UTC] jdephix at polenord dot com

I forgot to add that I did manage to deal with the UTF-16BE file by reversing everything.

$s = file_get_contents($fileUTF16BE);
$s = mb_convert_encoding($s, 'UTF-8', "UTF-16LE");
//some operations on $s
file_put_contents($anotherUTF16BEfile, mb_convert_encoding($s,
'UTF-16LE', "UTF-8"));

I need to specify "UTF-16LE" in order to be sure I work with "UTF-16BE".

[2011-04-06 15:20 UTC] me+phpbugs at ryanmccue dot info

We're also able to reproduce this, with a much smaller test case:

Reproduce code:
---------------
mb_convert_encoding("\xfe\xff\x22\x1e", 'UTF-8', 'UTF-16');


Expected result:
----------------
\xe2\x88\x9e


Actual result:
--------------
\xef\xbb\xbf\xe2\x88\x9e

[2011-04-07 04:12 UTC] me+phpbugs at ryanmccue dot info

Alternatively:

Reproduce code:
---------------
bin2hex(mb_convert_encoding("\xfe\xff\x22\x1e", 'UTF-8', 'UTF-16'));


Expected result:
----------------
e2889e


Actual result:
--------------
efbbbfe2889e

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2024 The PHP Group All rights reserved.	Last updated: Thu Apr 25 16:01:28 2024 UTC