|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2005-10-07 11:47 UTC] narzeczony at zabuchy dot net
Description: ------------ When converting from UTF-16 (to ISO-8859-1 for example) BOM section (2 first bytes of UTF-16 text) should be removed, while mb_convert_encoding function is trying to convert them. Problem is similar to bug #22108 but maybe this one can be fixed. Reproduce code: --------------- $iso_8859_1 = 'Nexor'; $utf16LE = mb_convert_encoding($iso_8859_1,'UTF-16LE','ISO-8859-1'); $utf16BE = mb_convert_encoding($iso_8859_1,'UTF-16BE','ISO-8859-1'); //lets convert both to UTF-16 //the only difference is 2 byte long BOM field added at the beggining // \xFF\xFE for little endian $utf16LE = "\xFF\xFE".$utf16LE; foreach (str_split($utf16LE) as $l) {echo ord($l).' ';} echo ' --> '; $utf16LE2iso = mb_convert_encoding($utf16LE,'ISO-8859-1','UTF-16'); var_dump($utf16LE2iso); echo '<br/>'; // \xFE\xFF for big endian $utf16BE = "\xFE\xFF".$utf16BE; foreach (str_split($utf16BE) as $l) {echo ord($l).' ';} echo ' --> '; $utf16BE2iso = mb_convert_encoding($utf16BE,'ISO-8859-1','UTF-16'); var_dump($utf16BE2iso); Expected result: ---------------- 255 254 78 0 101 0 120 0 111 0 114 0 --> string(5) "Nexor" 254 255 0 78 0 101 0 120 0 111 0 114 --> string(5) "Nexor" Actual result: -------------- 255 254 78 0 101 0 120 0 111 0 114 0 --> string(6) "??exor" 254 255 0 78 0 101 0 120 0 111 0 114 --> string(6) "?Nexor" PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Wed Oct 29 15:00:02 2025 UTC |
There are two problems when mb_convert_encoding is converting from UTF-16: 1) It is including the (transcoded) BOM in the result, rather than stripping it 2) If the source UTF-16 string was little endian, then the second character of the conversion will be wrong; it is converted as if the character code had 0xFF00 or'd into it. Problem 1 occurs with any UTF-16 variant (though it is arguably correct behavior for UTF-16LE and UTF-16BE). Problem 2 only occurs when converting from UTF-16. This PHP program demonstrates this all clearly: function dump($s) { for ($i = 0; $i < strlen($s); ++$i) { echo substr(dechex(256+ord(substr($s, $i, 1))), 1, 2), ' '; } var_dump($s); } $utf16le = "\xFF\xFE\x41\x00\x42\x00\x43\x00"; $utf16be = "\xFE\xFF\x00\x41\x00\x42\x00\x43"; // these strings are both valid UTF-16, the BOM at the start indicates // the endianness. We don't expect the BOM to be included in a conversion echo "The UTF-16LE and UTF-16BE sequences:\n"; dump($utf16le); dump($utf16be); echo "\n"; $encodings = array("ascii", "iso-8859-1", "utf-8", "utf-16", "utf-16le", "utf-16be"); foreach ($encodings as $enc) { echo "Converting to $enc:\n"; dump(mb_convert_encoding($utf16le, $enc, "utf-16")); dump(mb_convert_encoding($utf16be, $enc, "utf-16")); echo "\n"; }We're also able to reproduce this, with a much smaller test case: Reproduce code: --------------- mb_convert_encoding("\xfe\xff\x22\x1e", 'UTF-8', 'UTF-16'); Expected result: ---------------- \xe2\x88\x9e Actual result: -------------- \xef\xbb\xbf\xe2\x88\x9eAlternatively: Reproduce code: --------------- bin2hex(mb_convert_encoding("\xfe\xff\x22\x1e", 'UTF-8', 'UTF-16')); Expected result: ---------------- e2889e Actual result: -------------- efbbbfe2889e