|  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #36994 mb_detect_encoding returns wrong result with trailing accent
Submitted: 2006-04-06 11:49 UTC Modified: 2006-04-18 07:16 UTC
Avg. Score:4.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:1 (100.0%)
Same OS:1 (100.0%)
From: ynynmzvqofeaz at mailinator dot com Assigned: hirokawa (profile)
Status: Not a bug Package: mbstring related
PHP Version: 4.4.2 OS: Linux
Private report: No CVE-ID: None
View Add Comment Developer Edit
Anyone can comment on a bug. Have a simpler test case? Does it work for you on a different platform? Let us know!
Just going to say 'Me too!'? Don't clutter the database with that please !
Your email address:
Solve the problem:
39 - 17 = ?
Subscribe to this entry?

 [2006-04-06 11:49 UTC] ynynmzvqofeaz at mailinator dot com
mb_detect_encoding returns wrong result when text contains a trailing accent.


Add a Patch

Pull Requests

Add a Pull Request


AllCommentsChangesGit/SVN commitsRelated reports
 [2006-04-10 07:02 UTC] ynynmzvqofeaz1 at mailinator dot com

echo mb_detect_encoding("test?");

use iconv -f utf-8 -t iso-8859-1 INFILE > OUTFILE
if "file OUTFILE" says utf-8.
 [2006-04-10 11:53 UTC] ynynmzvqofeaz at mailinator dot com
Ignore the last comment. Do this:

Create two files with the following content, and name them test_iso1 and test_utf8:
echo mb_detect_encoding("test?");

Make sure the encoding is correct:
$ file test_iso1
should return iso-8859-1
$ file test_utf8
should return utf-8

If they do not return the correct encoding, use iconv to convert them, e.g.
$ iconv -f utf-8 -t iso-8859-1 test_iso1 >test_iso1.fixed
$ iconv -f iso-8859-1 -t utf-8 test_utf8 >test_utf8.fixed

Now run each script. The test_iso1 script should return a type of iso1, the test_utf8 script should return a type of utf8.

Workaround: append an extra character to the end of the string, and then remove it(!)
 [2006-04-17 15:41 UTC]
It is not a bug, it is a specification.
You should use 'strict' mode in mb_detect_encoding() 
if you need to return correct result.

mb_detect_encoding() treat the string as byte-stream.
{0x61,0x63,0x63,0x65,0x6e,0x74,0x75,0xe9} is a correct
UTF-8 byte stream.
In this case, 0xe9 is treat as the first byte of
multibyte character. 

{0x61,0x63,0x63,0x65,0x6e,0x74,0x75,0xe9,0x65} is wrong
UTF-8 byte stream because 0xe965 is invalid byte sequence in

If you need to remove the incomplete multibyte character from detection, please try to use 'strict' option like,
echo mb_detect_encoding($s1 , 'UTF-8, ISO-8859-1',true);

 [2006-04-18 07:16 UTC] ynynmzvqofeaz at mailinator dot com
So to use my example from before, why do both
 $string = "test?"
in a utf-8 text file, and 
 $string = "test?"
in an iso-8859-1 file (converted using iconv) return "UTF-8" with mb_detect_encoding, even when strict is on?
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sun May 26 18:01:33 2024 UTC