php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #47076 binary representation of unicode
Submitted: 2009-01-12 12:15 UTC Modified: 2009-01-17 18:09 UTC
From: lunter at interia dot pl Assigned:
Status: Not a bug Package: Unicode Engine related
PHP Version: 6CVS-2009-01-12 (CVS) OS: all
Private report: No CVE-ID: None
 [2009-01-12 12:15 UTC] lunter at interia dot pl
Description:
------------
converting binary<->string without charset translating for view binary representation of unicode or generate unicode from valid binary consists unicode sequenses

note that: unicode_encode/unicode_decode using charset translating, see Reproduce code

Example 1:

You have (binary)$b. It consists two bytes: 11001110 10110010
Its length in binary representation is two.
It is also valid one-length UTF-8 char(946) (greek small letter beta)
How to conver it ($b) into one-char UTF-8 string??
When we try $u=(string)$b, it gives two-char UTF-8 string.

Example 2:

You have (string)$u UTF-8 one-char string. It consists chr(946) (greek
small letter beta)
Now You have to see two bytes binary representation of this (11001110
10110010).
There is no way to convert it without charset translation...

Reproduce code:
---------------
<?
 $s=chr(946);
 print(strlen($s));

 print('<br>');

 $b=unicode_encode($s,'iso-8859-1');

 print(strlen($b));
?>

Expected result:
----------------
1 (unicode 1 char)
2 (binary 2 bytes) [11001110 10110010]

Actual result:
--------------
1
1


no way to converting binary<->string without charset translating
in binary we have length = 1 but it is 2 bytes

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2009-01-12 12:25 UTC] lunter at interia dot pl
Two new functions needed:

(binary) uni2bin( (string) unicode data )
(string) bin2uni( (binary) binary data )


diference beetwen unicode_(en|de)code is: convert WITHOUT using charser translation
 [2009-01-12 12:40 UTC] lunter at interia dot pl
Example 3:

<?
 print('You have to calculate base64 of unicode chr(946)<br>');
 print('regular base64 of unicode chr(946) (\uceb2) is: zrI='.'<br><br>');

 $unicode=chr(946);

 print(base64_encode((binary)$unicode));
 print('<br>');
 print(base64_encode(unicode_encode($unicode,'iso-8859-1')));
 print('<br>');

// print(base64_encode(uni2bin($unicode))); // zrI=
?>
 [2009-01-12 12:45 UTC] lunter at interia dot pl
Example 4:

<?
 print('You have to calculate sha1 of unicode chr(946)<br>');
 print('regular sha1 of unicode chr(946) (\uceb2) is: 25b9b2c8a851851c7e0f1cff29a93a6aa6895f34'.'<br><br>');

 $unicode=chr(946);

 print(sha1((binary)$unicode));
 print('<br>');
 print(sha1(unicode_encode($unicode,'iso-8859-1')));
 print('<br>');

// print(sha1(uni2bin($unicode))); // 25b9b2c8a851851c7e0f1cff29a93a6aa6895f34
?>
 [2009-01-12 12:50 UTC] lunter at interia dot pl
Please imagine that unicode chr(946) in binary have two bytes [11001110 10110010].
 [2009-01-12 12:54 UTC] lunter at interia dot pl
All examples above in utf-8
Imagine that using utf-16, sha1 and base64 will be not the same.
 [2009-01-12 13:25 UTC] lunter at interia dot pl
USE OLD PHP 5.x

valids values of UTF-8 char(946) base64 / sha1

<?
 print('UTF-8 char(946):<br>');
 print('base64: '.base64_encode(chr(206).chr(178)).'<br>');
 print('sha1: '.sha1(chr(206).chr(178)).'<br>');
?>
 [2009-01-12 13:39 UTC] lunter at interia dot pl
USE OLD PHP 5.x

// ---

valids values of UTF-16LE char(946) base64 / sha1

<?
 print('UTF-16LE char(946):<br>');
 print('base64: '.base64_encode(chr(178).chr(3)).'<br>');
 print('sha1: '.sha1(chr(178).chr(3)).'<br>');
?>

// ---

valids values of UTF-16BE char(946) base64 / sha1

<?
 print('UTF-16BE char(946):<br>');
 print('base64: '.base64_encode(chr(3).chr(178)).'<br>');
 print('sha1: '.sha1(chr(3).chr(178)).'<br>');
?>
 [2009-01-12 13:45 UTC] lunter at interia dot pl
There is no way to calculate base64, sha1 from unicode string (unicode.script_encoding = UTF-8 [or UTF-16LE,UTF-16BE]) starting from $unicode=chr(946)

because we don't have method to convert mutli-byte character sets to it's binary representation 

$unicode=chr(946)

When stript encoding UTF-8, chr(946)
base64($unicode) is zrI=
sha1($unicode) is 25b9b2c8a851851c7e0f1cff29a93a6aa6895f34

When stript encoding UTF-16LE, chr(946)
base64($unicode) is sgM=
sha1($unicode) is e84c936ce61a692fcc5a402b3b9b733592ba0b67

When stript encoding UTF-16BE, chr(946)
base64($unicode) is A7I=
sha1($unicode) is 2403f70ce33aeec4e21a519ffebb2864afc89fda
 [2009-01-12 14:57 UTC] lunter at interia dot pl
Note that:

chr(206).chr(178) is binary representation of UTF-8 char no. 946
chr(178).chr(3) is binary representation of UTF-16LE char no. 946
chr(3).chr(178) is binary representation of UTF-16BE char no. 946
 [2009-01-17 18:09 UTC] johannes@php.net
Converting unicode<->binary will always need charset infomation. If you need utf-16 data you can ask unicode_encode/unicode_decode to use Utf-16.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Dec 26 19:01:30 2024 UTC