php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #72933 mb_detect_encoding analyzing only the first byte of a string
Submitted: 2016-08-24 12:37 UTC Modified: 2016-08-31 07:46 UTC
Votes:1
Avg. Score:4.0 ± 0.0
Reproduced:0 of 0 (0.0%)
From: paul dot crovella at gmail dot com Assigned:
Status: Not a bug Package: mbstring related
PHP Version: Irrelevant OS:
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: paul dot crovella at gmail dot com
New email:
PHP Version: OS:

 

 [2016-08-24 12:37 UTC] paul dot crovella at gmail dot com
Description:
------------
In non-strict mode it appears only the first byte of a string is being checked by mb_detect_encoding in some circumstances.

For example, the byte 0xf8 is not allowed anywhere in UTF-8. When placed at the start of the string mb_detect_encoding() properly returns false for it regardless of which mode is used. However if any valid UTF-8 byte occurs at the beginning of the string mb_detect_encoding() in non-strict mode will declare it as UTF-8.

This is also evident with a string like "\xe1\xe9\xf3\xfa", which is the ISO-8859-1 encoded version of "áéóú". The first byte, 0xe1, is allowed in UTF-8 as the first of a multi-byte character - however the string as a whole is invalid and may not occur.

The suspect code is at: 
https://github.com/php/php-src/blob/c72282a13b12b7e572469eba7a7ce593d900a8a2/ext/mbstring/libmbfl/mbfl/mbfilter.c#L746-L761

The problem exists in all current versions of PHP:
https://3v4l.org/b9b6q

Test script:
---------------
// This returns as expected.
$str = "\xf8foo";
var_dump(
    mb_detect_encoding($str, 'UTF-8'),      // bool(false)
    mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);

// This does not.
$str = "foo\xf8";
var_dump(
    mb_detect_encoding($str, 'UTF-8'),      // string(5) "UTF-8"
    mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);

// Nor does this.
$str = "\xe1\xe9\xf3\xfa";
var_dump(
    mb_detect_encoding($str, 'UTF-8'),      // string(5) "UTF-8"
    mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);


Expected result:
----------------
bool(false)
bool(false)
bool(false)
bool(false)
bool(false)
bool(false)

Actual result:
--------------
bool(false)
bool(false)
string(5) "UTF-8"
bool(false)
string(5) "UTF-8"
bool(false)

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2016-08-24 12:50 UTC] paul dot crovella at gmail dot com
Additional finding: it seems to be a problem when only a single encoding is given on the list to detect. Even simply repeating UTF-8 achieves the expected result, e.g. mb_detect_encoding($str, 'UTF-8, UTF-8')

https://3v4l.org/g4BOv
 [2016-08-31 05:24 UTC] yohgaki@php.net
FYI. When you specify single encoding, the proper function for this task is "mb_check_encoding()" rather than "mb_detect_encoding()".

None the less, mb_detect_encoding() should work correctly though.
 [2016-08-31 05:31 UTC] yohgaki@php.net
-Status: Open +Status: Not a bug
 [2016-08-31 05:31 UTC] yohgaki@php.net
Oops. This behavior is expected when strict option is disabled. Since there is only a encoding in encoding list, it returns expected encoding name as soon as chars look like the target encoding.

Use mb_check_encoding() to check encoding validity.
 [2016-08-31 07:46 UTC] paul dot crovella at gmail dot com
The reason you've given for this not being a bug is just a restatement of the bug itself. Can you explain why exactly bailing out after a single byte would be expected and desired behavior?

It's neither documented or intuitive, and it's not uncommon for a developer to provide a single item argument like this when trying out a function to see how it handles. There's no basis for anyone using it to expect a completely different behavior when only a single encoding is given, and they shouldn't have to check the length of the supported encoding list before attempting to use this function.

The 5-year-old highest-rate comment on the docs page for mb_detect_encoding describes non-strict mode as "pretty worthless"[1] because of this problem, and it recently came up again on StackOverflow[2] asking for an explanation.

I don't know who finds this behavior to be expected, but it isn't any of the users.

Your recommendation to use mb_check_encoding instead only sidesteps the problem. Personally I don't recommend anyone use mb_detect_encoding at all (the idea that you can "detect" a text encoding by looking at it is fundamentally flawed), but as long as it's going to exist it may as well try to do something sane.

[1] http://php.net/manual/en/function.mb-detect-encoding.php#102510
[2] http://stackoverflow.com/q/39117203/3942918
 
PHP Copyright © 2001-2022 The PHP Group
All rights reserved.
Last updated: Mon Dec 05 17:03:50 2022 UTC