|  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #47076 binary representation of unicode
Submitted: 2009-01-12 12:15 UTC Modified: 2009-01-17 18:09 UTC
From: lunter at interia dot pl Assigned:
Status: Not a bug Package: Unicode Engine related
PHP Version: 6CVS-2009-01-12 (CVS) OS: all
Private report: No CVE-ID: None
View Add Comment Developer Edit
Anyone can comment on a bug. Have a simpler test case? Does it work for you on a different platform? Let us know!
Just going to say 'Me too!'? Don't clutter the database with that please !
Your email address:
Solve the problem:
31 - 10 = ?
Subscribe to this entry?

 [2009-01-12 12:15 UTC] lunter at interia dot pl
converting binary<->string without charset translating for view binary representation of unicode or generate unicode from valid binary consists unicode sequenses

note that: unicode_encode/unicode_decode using charset translating, see Reproduce code

Example 1:

You have (binary)$b. It consists two bytes: 11001110 10110010
Its length in binary representation is two.
It is also valid one-length UTF-8 char(946) (greek small letter beta)
How to conver it ($b) into one-char UTF-8 string??
When we try $u=(string)$b, it gives two-char UTF-8 string.

Example 2:

You have (string)$u UTF-8 one-char string. It consists chr(946) (greek
small letter beta)
Now You have to see two bytes binary representation of this (11001110
There is no way to convert it without charset translation...

Reproduce code:




Expected result:
1 (unicode 1 char)
2 (binary 2 bytes) [11001110 10110010]

Actual result:

no way to converting binary<->string without charset translating
in binary we have length = 1 but it is 2 bytes


Add a Patch

Pull Requests

Add a Pull Request


AllCommentsChangesGit/SVN commitsRelated reports
 [2009-01-12 12:25 UTC] lunter at interia dot pl
Two new functions needed:

(binary) uni2bin( (string) unicode data )
(string) bin2uni( (binary) binary data )

diference beetwen unicode_(en|de)code is: convert WITHOUT using charser translation
 [2009-01-12 12:40 UTC] lunter at interia dot pl
Example 3:

 print('You have to calculate base64 of unicode chr(946)<br>');
 print('regular base64 of unicode chr(946) (\uceb2) is: zrI='.'<br><br>');



// print(base64_encode(uni2bin($unicode))); // zrI=
 [2009-01-12 12:45 UTC] lunter at interia dot pl
Example 4:

 print('You have to calculate sha1 of unicode chr(946)<br>');
 print('regular sha1 of unicode chr(946) (\uceb2) is: 25b9b2c8a851851c7e0f1cff29a93a6aa6895f34'.'<br><br>');



// print(sha1(uni2bin($unicode))); // 25b9b2c8a851851c7e0f1cff29a93a6aa6895f34
 [2009-01-12 12:50 UTC] lunter at interia dot pl
Please imagine that unicode chr(946) in binary have two bytes [11001110 10110010].
 [2009-01-12 12:54 UTC] lunter at interia dot pl
All examples above in utf-8
Imagine that using utf-16, sha1 and base64 will be not the same.
 [2009-01-12 13:25 UTC] lunter at interia dot pl

valids values of UTF-8 char(946) base64 / sha1

 print('UTF-8 char(946):<br>');
 print('base64: '.base64_encode(chr(206).chr(178)).'<br>');
 print('sha1: '.sha1(chr(206).chr(178)).'<br>');
 [2009-01-12 13:39 UTC] lunter at interia dot pl

// ---

valids values of UTF-16LE char(946) base64 / sha1

 print('UTF-16LE char(946):<br>');
 print('base64: '.base64_encode(chr(178).chr(3)).'<br>');
 print('sha1: '.sha1(chr(178).chr(3)).'<br>');

// ---

valids values of UTF-16BE char(946) base64 / sha1

 print('UTF-16BE char(946):<br>');
 print('base64: '.base64_encode(chr(3).chr(178)).'<br>');
 print('sha1: '.sha1(chr(3).chr(178)).'<br>');
 [2009-01-12 13:45 UTC] lunter at interia dot pl
There is no way to calculate base64, sha1 from unicode string (unicode.script_encoding = UTF-8 [or UTF-16LE,UTF-16BE]) starting from $unicode=chr(946)

because we don't have method to convert mutli-byte character sets to it's binary representation 


When stript encoding UTF-8, chr(946)
base64($unicode) is zrI=
sha1($unicode) is 25b9b2c8a851851c7e0f1cff29a93a6aa6895f34

When stript encoding UTF-16LE, chr(946)
base64($unicode) is sgM=
sha1($unicode) is e84c936ce61a692fcc5a402b3b9b733592ba0b67

When stript encoding UTF-16BE, chr(946)
base64($unicode) is A7I=
sha1($unicode) is 2403f70ce33aeec4e21a519ffebb2864afc89fda
 [2009-01-12 14:57 UTC] lunter at interia dot pl
Note that:

chr(206).chr(178) is binary representation of UTF-8 char no. 946
chr(178).chr(3) is binary representation of UTF-16LE char no. 946
chr(3).chr(178) is binary representation of UTF-16BE char no. 946
 [2009-01-17 18:09 UTC]
Converting unicode<->binary will always need charset infomation. If you need utf-16 data you can ask unicode_encode/unicode_decode to use Utf-16.
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Tue Jul 16 16:01:27 2024 UTC