PHP :: Bug #72933 :: mb_detect_encoding analyzing only the first byte of a string

Bug #72933

mb_detect_encoding analyzing only the first byte of a string

Submitted:

2016-08-24 12:37 UTC

Modified:

2016-08-31 07:46 UTC

Votes:	1
Avg. Score:	4.0 ± 0.0
Reproduced:	0 of 0 (0.0%)

From:

paul dot crovella at gmail dot com

Assigned:

Status:

Not a bug

Package:

mbstring related

PHP Version:

Irrelevant

OS:

Private report:

CVE-ID:

None

View Developer Edit

Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.

Password:

Status:
Package:
Bug Type:
Summary:
From:	paul dot crovella at gmail dot com
New email:
PHP Version:		OS:

New Comment:

[2016-08-24 12:37 UTC] paul dot crovella at gmail dot com

Description:
------------
In non-strict mode it appears only the first byte of a string is being checked by mb_detect_encoding in some circumstances.

For example, the byte 0xf8 is not allowed anywhere in UTF-8. When placed at the start of the string mb_detect_encoding() properly returns false for it regardless of which mode is used. However if any valid UTF-8 byte occurs at the beginning of the string mb_detect_encoding() in non-strict mode will declare it as UTF-8.

This is also evident with a string like "\xe1\xe9\xf3\xfa", which is the ISO-8859-1 encoded version of "áéóú". The first byte, 0xe1, is allowed in UTF-8 as the first of a multi-byte character - however the string as a whole is invalid and may not occur.

The suspect code is at: 
https://github.com/php/php-src/blob/c72282a13b12b7e572469eba7a7ce593d900a8a2/ext/mbstring/libmbfl/mbfl/mbfilter.c#L746-L761

The problem exists in all current versions of PHP:
https://3v4l.org/b9b6q

Test script:
---------------
// This returns as expected.
$str = "\xf8foo";
var_dump(
    mb_detect_encoding($str, 'UTF-8'),      // bool(false)
    mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);

// This does not.
$str = "foo\xf8";
var_dump(
    mb_detect_encoding($str, 'UTF-8'),      // string(5) "UTF-8"
    mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);

// Nor does this.
$str = "\xe1\xe9\xf3\xfa";
var_dump(
    mb_detect_encoding($str, 'UTF-8'),      // string(5) "UTF-8"
    mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);


Expected result:
----------------
bool(false)
bool(false)
bool(false)
bool(false)
bool(false)
bool(false)

Actual result:
--------------
bool(false)
bool(false)
string(5) "UTF-8"
bool(false)
string(5) "UTF-8"
bool(false)

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2016-08-24 12:50 UTC] paul dot crovella at gmail dot com

Additional finding: it seems to be a problem when only a single encoding is given on the list to detect. Even simply repeating UTF-8 achieves the expected result, e.g. mb_detect_encoding($str, 'UTF-8, UTF-8')

https://3v4l.org/g4BOv

[2016-08-31 05:24 UTC] yohgaki@php.net

FYI. When you specify single encoding, the proper function for this task is "mb_check_encoding()" rather than "mb_detect_encoding()".

None the less, mb_detect_encoding() should work correctly though.

[2016-08-31 05:31 UTC] yohgaki@php.net

-Status: Open +Status: Not a bug

[2016-08-31 05:31 UTC] yohgaki@php.net

Oops. This behavior is expected when strict option is disabled. Since there is only a encoding in encoding list, it returns expected encoding name as soon as chars look like the target encoding.

Use mb_check_encoding() to check encoding validity.

[2016-08-31 07:46 UTC] paul dot crovella at gmail dot com

The reason you've given for this not being a bug is just a restatement of the bug itself. Can you explain why exactly bailing out after a single byte would be expected and desired behavior?

It's neither documented or intuitive, and it's not uncommon for a developer to provide a single item argument like this when trying out a function to see how it handles. There's no basis for anyone using it to expect a completely different behavior when only a single encoding is given, and they shouldn't have to check the length of the supported encoding list before attempting to use this function.

The 5-year-old highest-rate comment on the docs page for mb_detect_encoding describes non-strict mode as "pretty worthless"[1] because of this problem, and it recently came up again on StackOverflow[2] asking for an explanation.

I don't know who finds this behavior to be expected, but it isn't any of the users.

Your recommendation to use mb_check_encoding instead only sidesteps the problem. Personally I don't recommend anyone use mb_detect_encoding at all (the idea that you can "detect" a text encoding by looking at it is fundamentally flawed), but as long as it's going to exist it may as well try to do something sane.

[1] http://php.net/manual/en/function.mb-detect-encoding.php#102510
[2] http://stackoverflow.com/q/39117203/3942918

[2017-07-22 20:07 UTC] nikic@php.net

Related To: Bug #67386

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2026 The PHP Group All rights reserved.	Last updated: Mon Jul 27 16:00:01 2026 UTC