php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #68119 mb_detect_encoding can not detect encoding rightly
Submitted: 2014-09-30 00:55 UTC Modified: 2014-09-30 02:41 UTC
From: zf at ancientrock dot org Assigned:
Status: Not a bug Package: mbstring related
PHP Version: master-Git-2014-09-30 (Git) OS: Linux
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: zf at ancientrock dot org
New email:
PHP Version: OS:

 

 [2014-09-30 00:55 UTC] zf at ancientrock dot org
Description:
------------
mb_detect_encoding can not detect encoding rightly

$str1 = hex2bin('e58da0e6a5bc');
$str2 = hex2bin('d5bcc2a5');

we want to detect encoding to see what's it!

CP936? EUC-CN ? UTF-8?

Test script:
---------------
$str1 = hex2bin('e58da0e6a5bc');
$str2 = hex2bin('d5bcc2a5');
var_dump(mb_detect_encoding($str1, 'CP936', true));
var_dump(mb_detect_encoding($str1, 'EUC-CN', true));
var_dump(mb_detect_encoding($str1, 'UTF-8', true));
var_dump(mb_detect_encoding($str2, 'CP936', true));
var_dump(mb_detect_encoding($str2, 'EUC-CN', true));
var_dump(mb_detect_encoding($str2, 'UTF-8', true));

Expected result:
----------------
$str1 was UTF-8 encoding,

$str2 was CP936 encoding.

Actual result:
--------------
All detect return true

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2014-09-30 02:07 UTC] requinix@php.net
-Status: Open +Status: Not a bug
 [2014-09-30 02:07 UTC] requinix@php.net
I get E58DA0E6A5BC as invalid EUC-CN. http://3v4l.org/rEcG5

But the rest is correct. Just because it's not the string you expect doesn't mean it's invalid.
E58DA0E6A5BC as
  CP936:  鍗玳ゼ
  EUC-CN: invalid
  UTF-8:  占楼

D5BCC2A5 as
  CP936:  占楼
  EUC-CN: 媼促
  UTF-8:  ռ¥
 [2014-09-30 02:11 UTC] requinix@php.net
>EUC-CN: 媼促
Probably not right, actually, taking a second look at my source. But D5BCC2A5 is still a valid byte sequence for that encoding.
 [2014-09-30 02:23 UTC] zf at ancientrock dot org
It's really a bug, One word can only had one encoding, can not be all encoding the same time. or we can not coding right
 [2014-09-30 02:41 UTC] requinix@php.net
I tried responding to your email but your mail server is rejecting my reply.

This not a bug. PHP strings do not have encodings - they are just plain bytes. That's why one PHP string can "be" multiple character sequences at once: it depends which encoding you (or your browser) use to interpret those bytes.
To your issue, you do not have to detect character encodings yourself because you can tell the browser which encoding to use and it will do so, both when interpreting your HTML and when POSTing form data.

Character encodings are not a trivial issue but this is not the place to explain what you need to do. However there are many places on the internet that do explain it so you should search around, or else ask on an online forum or other support channel.
http://php.net/support.php
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Mar 29 12:01:27 2024 UTC