php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #74528 mb_check_encoding Documentation Confusing
Submitted: 2017-05-02 10:08 UTC Modified: 2017-05-02 15:39 UTC
From: james dot antrim at nm dot thm dot de Assigned: cmb (profile)
Status: Not a bug Package: mbstring related
PHP Version: 5.6.30 OS: Debian 3.16.39-1+deb8u2 (2017-03
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: james dot antrim at nm dot thm dot de
New email:
PHP Version: OS:

 

 [2017-05-02 10:08 UTC] james dot antrim at nm dot thm dot de
Description:
------------
---
From manual page: http://www.php.net/function.mb-detect-encoding
---

I saved a file that was originally ISO-8859-1 to several different coding standards, among them UTF-8 and Japanese Shift JIS. When using this function with 'UTF-8' as the 'encoding list' parameter I always get 'UTF-8' as the return value.

I assume this is working as intended, because it really solves both the problem of detection and conversion in one function. I find the return documentation more than confusing, because, despite the function being called 'detect' it does not tell me the original string's encoding, but the encoding of the converted string if conversion was  possible. My suggestion would be:
"The character encoding or FALSE if the encoding cannot be converted to any of the given strings."


Test script:
---------------
Convert any ISO-8859-1 file to any other encoding, then run mb_detect_encoding($file, 'UTF-8', true) on it.

This same documentation problem is evident for mb_check_encoding($file, 'UTF-8') which will always return true under the same conditions.


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2017-05-02 14:33 UTC] cmb@php.net
-Status: Open +Status: Not a bug -Package: Documentation problem +Package: mbstring related -Assigned To: +Assigned To: cmb
 [2017-05-02 14:33 UTC] cmb@php.net
mb_detect_encoding() is not supposed to do any conversion, so the
documentation is correct in this regard. However, if you want to
check for 'UTF-8' (and maybe other encodings as well), you have
to use strict detection, see <https://3v4l.org/DFIAs> vs.
<https://3v4l.org/M23D4> and also
<http://php.net/manual/en/function.mb-detect-encoding.php#102510>.
 [2017-05-02 15:39 UTC] requinix@php.net
> it does not tell me the original string's encoding, but the encoding of the
> converted string if conversion was  possible. 
There's no way to know for sure what the original encoding of a string or file was. They simply don't store that kind of information. All PHP can do is look through a set of encodings and find the first one that supports the byte sequence. It also doesn't help that most encodings out there are identical in the standard ASCII range (bytes 0x20-0x7E).
You, as the developer, having more knowledge about the input string than PHP, have to construct that set of encodings carefully while taking into account the nature of the byte sequences they create and how they can overlap with each other.
 
PHP Copyright © 2001-2022 The PHP Group
All rights reserved.
Last updated: Sun Oct 02 11:05:54 2022 UTC