PHP :: Bug #72933 :: mb_detect_encoding analyzing only the first byte of a string

Bug #72933

mb_detect_encoding analyzing only the first byte of a string

Submitted:

2016-08-24 12:37 UTC

Modified:

2016-08-31 07:46 UTC

Votes:	1
Avg. Score:	4.0 ± 0.0
Reproduced:	0 of 0 (0.0%)

From:

paul dot crovella at gmail dot com

Assigned:

Status:

Not a bug

Package:

mbstring related

PHP Version:

Irrelevant

OS:

Private report:

CVE-ID:

None

View Developer Edit

Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.

php.net Username: php.net Password:

Quick Fix:	(description)
	Block user comment
Status:		Assign to:
Package:
Bug Type:
Summary:
From:	paul dot crovella at gmail dot com
New email:
PHP Version:		OS:

New/Additional Comment:

[2016-08-24 12:37 UTC] paul dot crovella at gmail dot com

Description:
------------
In non-strict mode it appears only the first byte of a string is being checked by mb_detect_encoding in some circumstances.

For example, the byte 0xf8 is not allowed anywhere in UTF-8. When placed at the start of the string mb_detect_encoding() properly returns false for it regardless of which mode is used. However if any valid UTF-8 byte occurs at the beginning of the string mb_detect_encoding() in non-strict mode will declare it as UTF-8.

This is also evident with a string like "\xe1\xe9\xf3\xfa", which is the ISO-8859-1 encoded version of "áéóú". The first byte, 0xe1, is allowed in UTF-8 as the first of a multi-byte character - however the string as a whole is invalid and may not occur.

The suspect code is at: 
https://github.com/php/php-src/blob/c72282a13b12b7e572469eba7a7ce593d900a8a2/ext/mbstring/libmbfl/mbfl/mbfilter.c#L746-L761

The problem exists in all current versions of PHP:
https://3v4l.org/b9b6q

Test script:
---------------
// This returns as expected.
$str = "\xf8foo";
var_dump(
    mb_detect_encoding($str, 'UTF-8'),      // bool(false)
    mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);

// This does not.
$str = "foo\xf8";
var_dump(
    mb_detect_encoding($str, 'UTF-8'),      // string(5) "UTF-8"
    mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);

// Nor does this.
$str = "\xe1\xe9\xf3\xfa";
var_dump(
    mb_detect_encoding($str, 'UTF-8'),      // string(5) "UTF-8"
    mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);


Expected result:
----------------
bool(false)
bool(false)
bool(false)
bool(false)
bool(false)
bool(false)

Actual result:
--------------
bool(false)
bool(false)
string(5) "UTF-8"
bool(false)
string(5) "UTF-8"
bool(false)

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2016-08-24 12:50 UTC] paul dot crovella at gmail dot com

Additional finding: it seems to be a problem when only a single encoding is given on the list to detect. Even simply repeating UTF-8 achieves the expected result, e.g. mb_detect_encoding($str, 'UTF-8, UTF-8')

https://3v4l.org/g4BOv

[2016-08-31 05:24 UTC] yohgaki@php.net

FYI. When you specify single encoding, the proper function for this task is "mb_check_encoding()" rather than "mb_detect_encoding()".

None the less, mb_detect_encoding() should work correctly though.

[2016-08-31 05:31 UTC] yohgaki@php.net

-Status: Open +Status: Not a bug

[2016-08-31 05:31 UTC] yohgaki@php.net

Oops. This behavior is expected when strict option is disabled. Since there is only a encoding in encoding list, it returns expected encoding name as soon as chars look like the target encoding.

Use mb_check_encoding() to check encoding validity.

[2016-08-31 07:46 UTC] paul dot crovella at gmail dot com

The reason you've given for this not being a bug is just a restatement of the bug itself. Can you explain why exactly bailing out after a single byte would be expected and desired behavior?

It's neither documented or intuitive, and it's not uncommon for a developer to provide a single item argument like this when trying out a function to see how it handles. There's no basis for anyone using it to expect a completely different behavior when only a single encoding is given, and they shouldn't have to check the length of the supported encoding list before attempting to use this function.

The 5-year-old highest-rate comment on the docs page for mb_detect_encoding describes non-strict mode as "pretty worthless"[1] because of this problem, and it recently came up again on StackOverflow[2] asking for an explanation.

I don't know who finds this behavior to be expected, but it isn't any of the users.

Your recommendation to use mb_check_encoding instead only sidesteps the problem. Personally I don't recommend anyone use mb_detect_encoding at all (the idea that you can "detect" a text encoding by looking at it is fundamentally flawed), but as long as it's going to exist it may as well try to do something sane.

[1] http://php.net/manual/en/function.mb-detect-encoding.php#102510
[2] http://stackoverflow.com/q/39117203/3942918

[2017-07-22 20:07 UTC] nikic@php.net

Related To: Bug #67386

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Wed Jul 09 09:01:35 2025 UTC