php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #72933 mb_detect_encoding analyzing only the first byte of a string
Submitted: 2016-08-24 12:37 UTC Modified: 2016-08-31 07:46 UTC
Votes:1
Avg. Score:4.0 ± 0.0
Reproduced:0 of 0 (0.0%)
From: paul dot crovella at gmail dot com Assigned:
Status: Not a bug Package: mbstring related
PHP Version: Irrelevant OS:
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: paul dot crovella at gmail dot com
New email:
PHP Version: OS:

 

 [2016-08-24 12:37 UTC] paul dot crovella at gmail dot com
Description:
------------
In non-strict mode it appears only the first byte of a string is being checked by mb_detect_encoding in some circumstances.

For example, the byte 0xf8 is not allowed anywhere in UTF-8. When placed at the start of the string mb_detect_encoding() properly returns false for it regardless of which mode is used. However if any valid UTF-8 byte occurs at the beginning of the string mb_detect_encoding() in non-strict mode will declare it as UTF-8.

This is also evident with a string like "\xe1\xe9\xf3\xfa", which is the ISO-8859-1 encoded version of "áéóú". The first byte, 0xe1, is allowed in UTF-8 as the first of a multi-byte character - however the string as a whole is invalid and may not occur.

The suspect code is at: 
https://github.com/php/php-src/blob/c72282a13b12b7e572469eba7a7ce593d900a8a2/ext/mbstring/libmbfl/mbfl/mbfilter.c#L746-L761

The problem exists in all current versions of PHP:
https://3v4l.org/b9b6q

Test script:
---------------
// This returns as expected.
$str = "\xf8foo";
var_dump(
    mb_detect_encoding($str, 'UTF-8'),      // bool(false)
    mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);

// This does not.
$str = "foo\xf8";
var_dump(
    mb_detect_encoding($str, 'UTF-8'),      // string(5) "UTF-8"
    mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);

// Nor does this.
$str = "\xe1\xe9\xf3\xfa";
var_dump(
    mb_detect_encoding($str, 'UTF-8'),      // string(5) "UTF-8"
    mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);


Expected result:
----------------
bool(false)
bool(false)
bool(false)
bool(false)
bool(false)
bool(false)

Actual result:
--------------
bool(false)
bool(false)
string(5) "UTF-8"
bool(false)
string(5) "UTF-8"
bool(false)

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2016-08-24 12:50 UTC] paul dot crovella at gmail dot com
Additional finding: it seems to be a problem when only a single encoding is given on the list to detect. Even simply repeating UTF-8 achieves the expected result, e.g. mb_detect_encoding($str, 'UTF-8, UTF-8')

https://3v4l.org/g4BOv
 [2016-08-31 05:24 UTC] yohgaki@php.net
FYI. When you specify single encoding, the proper function for this task is "mb_check_encoding()" rather than "mb_detect_encoding()".

None the less, mb_detect_encoding() should work correctly though.
 [2016-08-31 05:31 UTC] yohgaki@php.net
-Status: Open +Status: Not a bug
 [2016-08-31 05:31 UTC] yohgaki@php.net
Oops. This behavior is expected when strict option is disabled. Since there is only a encoding in encoding list, it returns expected encoding name as soon as chars look like the target encoding.

Use mb_check_encoding() to check encoding validity.
 [2016-08-31 07:46 UTC] paul dot crovella at gmail dot com
The reason you've given for this not being a bug is just a restatement of the bug itself. Can you explain why exactly bailing out after a single byte would be expected and desired behavior?

It's neither documented or intuitive, and it's not uncommon for a developer to provide a single item argument like this when trying out a function to see how it handles. There's no basis for anyone using it to expect a completely different behavior when only a single encoding is given, and they shouldn't have to check the length of the supported encoding list before attempting to use this function.

The 5-year-old highest-rate comment on the docs page for mb_detect_encoding describes non-strict mode as "pretty worthless"[1] because of this problem, and it recently came up again on StackOverflow[2] asking for an explanation.

I don't know who finds this behavior to be expected, but it isn't any of the users.

Your recommendation to use mb_check_encoding instead only sidesteps the problem. Personally I don't recommend anyone use mb_detect_encoding at all (the idea that you can "detect" a text encoding by looking at it is fundamentally flawed), but as long as it's going to exist it may as well try to do something sane.

[1] http://php.net/manual/en/function.mb-detect-encoding.php#102510
[2] http://stackoverflow.com/q/39117203/3942918
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sat Dec 21 12:01:31 2024 UTC