php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #74528 mb_check_encoding Documentation Confusing
Submitted: 2017-05-02 10:08 UTC Modified: 2017-05-02 15:39 UTC
From: james dot antrim at nm dot thm dot de Assigned: cmb (profile)
Status: Not a bug Package: mbstring related
PHP Version: 5.6.30 OS: Debian 3.16.39-1+deb8u2 (2017-03
Private report: No CVE-ID: None
 [2017-05-02 10:08 UTC] james dot antrim at nm dot thm dot de
Description:
------------
---
From manual page: http://www.php.net/function.mb-detect-encoding
---

I saved a file that was originally ISO-8859-1 to several different coding standards, among them UTF-8 and Japanese Shift JIS. When using this function with 'UTF-8' as the 'encoding list' parameter I always get 'UTF-8' as the return value.

I assume this is working as intended, because it really solves both the problem of detection and conversion in one function. I find the return documentation more than confusing, because, despite the function being called 'detect' it does not tell me the original string's encoding, but the encoding of the converted string if conversion was  possible. My suggestion would be:
"The character encoding or FALSE if the encoding cannot be converted to any of the given strings."


Test script:
---------------
Convert any ISO-8859-1 file to any other encoding, then run mb_detect_encoding($file, 'UTF-8', true) on it.

This same documentation problem is evident for mb_check_encoding($file, 'UTF-8') which will always return true under the same conditions.


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2017-05-02 14:33 UTC] cmb@php.net
-Status: Open +Status: Not a bug -Package: Documentation problem +Package: mbstring related -Assigned To: +Assigned To: cmb
 [2017-05-02 14:33 UTC] cmb@php.net
mb_detect_encoding() is not supposed to do any conversion, so the
documentation is correct in this regard. However, if you want to
check for 'UTF-8' (and maybe other encodings as well), you have
to use strict detection, see <https://3v4l.org/DFIAs> vs.
<https://3v4l.org/M23D4> and also
<http://php.net/manual/en/function.mb-detect-encoding.php#102510>.
 [2017-05-02 15:39 UTC] requinix@php.net
> it does not tell me the original string's encoding, but the encoding of the
> converted string if conversion was  possible. 
There's no way to know for sure what the original encoding of a string or file was. They simply don't store that kind of information. All PHP can do is look through a set of encodings and find the first one that supports the byte sequence. It also doesn't help that most encodings out there are identical in the standard ASCII range (bytes 0x20-0x7E).
You, as the developer, having more knowledge about the input string than PHP, have to construct that set of encodings carefully while taking into account the nature of the byte sequences they create and how they can overlap with each other.
 
PHP Copyright © 2001-2022 The PHP Group
All rights reserved.
Last updated: Wed Sep 28 07:05:52 2022 UTC