PHP :: Bug #36994 :: mb_detect_encoding returns wrong result with trailing accent

Bug #36994

mb_detect_encoding returns wrong result with trailing accent

Submitted:

2006-04-06 11:49 UTC

Modified:

2006-04-18 07:16 UTC

Votes:	1
Avg. Score:	4.0 ± 0.0
Reproduced:	1 of 1 (100.0%)
Same Version:	1 (100.0%)
Same OS:	1 (100.0%)

From:

ynynmzvqofeaz at mailinator dot com

Assigned:

hirokawa (profile)

Status:

Not a bug

Package:

mbstring related

PHP Version:

4.4.2

OS:

Linux

Private report:

CVE-ID:

None

View Developer Edit

Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.

Password:

Status:
Package:
Bug Type:
Summary:
From:	ynynmzvqofeaz at mailinator dot com
New email:
PHP Version:		OS:

New Comment:

[2006-04-06 11:49 UTC] ynynmzvqofeaz at mailinator dot com

Description:
------------
mb_detect_encoding returns wrong result when text contains a trailing accent.
See http://www.php.net/manual/en/function.mb-detect-encoding.php#55228

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2006-04-10 07:02 UTC] ynynmzvqofeaz1 at mailinator dot com

*sigh*

<?php
echo mb_detect_encoding("test?");
?>

use iconv -f utf-8 -t iso-8859-1 INFILE > OUTFILE
if "file OUTFILE" says utf-8.

[2006-04-10 11:53 UTC] ynynmzvqofeaz at mailinator dot com

Ignore the last comment. Do this:

Create two files with the following content, and name them test_iso1 and test_utf8:
<?php
echo mb_detect_encoding("test?");
?>

Make sure the encoding is correct:
$ file test_iso1
should return iso-8859-1
$ file test_utf8
should return utf-8

If they do not return the correct encoding, use iconv to convert them, e.g.
$ iconv -f utf-8 -t iso-8859-1 test_iso1 >test_iso1.fixed
or
$ iconv -f iso-8859-1 -t utf-8 test_utf8 >test_utf8.fixed

Now run each script. The test_iso1 script should return a type of iso1, the test_utf8 script should return a type of utf8.

Workaround: append an extra character to the end of the string, and then remove it(!)

[2006-04-17 15:41 UTC] hirokawa@php.net

It is not a bug, it is a specification.
You should use 'strict' mode in mb_detect_encoding() 
if you need to return correct result.

mb_detect_encoding() treat the string as byte-stream.
{0x61,0x63,0x63,0x65,0x6e,0x74,0x75,0xe9} is a correct
UTF-8 byte stream.
In this case, 0xe9 is treat as the first byte of
multibyte character. 

{0x61,0x63,0x63,0x65,0x6e,0x74,0x75,0xe9,0x65} is wrong
UTF-8 byte stream because 0xe965 is invalid byte sequence in
UTF-8.

If you need to remove the incomplete multibyte character from detection, please try to use 'strict' option like,
echo mb_detect_encoding($s1 , 'UTF-8, ISO-8859-1',true);

[2006-04-18 07:16 UTC] ynynmzvqofeaz at mailinator dot com

So to use my example from before, why do both
 $string = "test?"
in a utf-8 text file, and 
 $string = "test?"
in an iso-8859-1 file (converted using iconv) return "UTF-8" with mb_detect_encoding, even when strict is on?

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Tue Dec 30 02:00:01 2025 UTC