php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #44014 mb_convert_encoding 'destroys' first character (UTF16->UTF8)
Submitted: 2008-02-01 12:08 UTC Modified: 2008-02-24 01:00 UTC
Votes:2
Avg. Score:4.5 ± 0.5
Reproduced:1 of 1 (100.0%)
Same Version:0 (0.0%)
Same OS:0 (0.0%)
From: michael202 at gmx dot de Assigned: hirokawa (profile)
Status: No Feedback Package: mbstring related
PHP Version: 5.2.5 OS: Win XP
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If this is not your bug, you can add a comment by following this link.
If this is your bug, but you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: michael202 at gmx dot de
New email:
PHP Version: OS:

 

 [2008-02-01 12:08 UTC] michael202 at gmx dot de
Description:
------------
mb_convert_encoding 'destroys' first character when
converting from UTF16 to UTF8

(iconv works).

Reproduce code:
---------------
$utf16 = chr(0xFF).chr(0xFE).chr(0x4d).chr(0).chr(0x6f).chr(0); //'Mo'

$utf8 = mb_convert_encoding($utf16, 'UTF-8', 'UTF-16');  

echo($utf8 . "\n");     // -> ?++???o

$utf8 = iconv('UTF-16', 'UTF-8', $utf16);  

echo($utf8 . "\n");     // -> Mo 


Expected result:
----------------
mb:    (BOM8)Mo
iconv: Mo

(BOM8) is a placeholder

Actual result:
--------------
mb:    (BOM8)???o  (copied from cmd shell)
iconv: Mo

(BOM8) is a placeholder



Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2008-02-05 05:10 UTC] jani@php.net
Assigned to the mbstring maintainer.
 [2008-02-16 12:17 UTC] hirokawa@php.net
BOM of Unicode is not supported by encoding conversion function 
in mbstring.

And big endian is default in UTF-16. Please specify 'UTF-16LE'
if you need to specify little endian format.

Try,

<?php
$utf16 = chr(0).chr(0x4d).chr(0).chr(0x6f); //'Mo'
$utf8 = mb_convert_encoding($utf16, 'UTF-8', 'UTF-16'); 
echo($utf8 . "\n");     // -> Mo
?>

or

<?php
$utf16 = chr(0x4d).chr(0).chr(0x6f).chr(0); //'Mo'
$utf8 = mb_convert_encoding($utf16, 'UTF-8', 'UTF-16LE'); 
echo($utf8 . "\n");     // -> Mo
?>

 [2008-02-24 01:00 UTC] php-bugs at lists dot php dot net
No feedback was provided for this bug for over a week, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
 [2008-03-28 09:44 UTC] d_kelsey at uk dot ibm dot com
My understanding of UTF-16 is that the BOM is a mandatory. For mbstring I have found that if I input a UTF-16 string for conversion in mb_convert_encoding for example to UTF-8, it treats the BOM as UTF-16 data and converts it.

MBString doesn't generate the BOM when converting to UTF-16, so as I thought the BOM was mandatory, it isn't generating valid UTF-16 bytes.

I see that MBString uses UTF-16BE effectively when you specify UTF-16.

If mbstring doesn't support BOM then UTF-16 cannot be handled properly. Should this at least be documented and recommend considering using UTF-16BE as the encoding so that you are explicit in what is supportable ?
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Mar 29 06:01:29 2024 UTC