php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #36994 mb_detect_encoding returns wrong result with trailing accent
Submitted: 2006-04-06 11:49 UTC Modified: 2006-04-18 07:16 UTC
Votes:1
Avg. Score:4.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:1 (100.0%)
Same OS:1 (100.0%)
From: ynynmzvqofeaz at mailinator dot com Assigned: hirokawa (profile)
Status: Not a bug Package: mbstring related
PHP Version: 4.4.2 OS: Linux
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: ynynmzvqofeaz at mailinator dot com
New email:
PHP Version: OS:

 

 [2006-04-06 11:49 UTC] ynynmzvqofeaz at mailinator dot com
Description:
------------
mb_detect_encoding returns wrong result when text contains a trailing accent.
See http://www.php.net/manual/en/function.mb-detect-encoding.php#55228


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2006-04-10 07:02 UTC] ynynmzvqofeaz1 at mailinator dot com
*sigh*

<?php
echo mb_detect_encoding("test?");
?>

use iconv -f utf-8 -t iso-8859-1 INFILE > OUTFILE
if "file OUTFILE" says utf-8.
 [2006-04-10 11:53 UTC] ynynmzvqofeaz at mailinator dot com
Ignore the last comment. Do this:

Create two files with the following content, and name them test_iso1 and test_utf8:
<?php
echo mb_detect_encoding("test?");
?>

Make sure the encoding is correct:
$ file test_iso1
should return iso-8859-1
$ file test_utf8
should return utf-8

If they do not return the correct encoding, use iconv to convert them, e.g.
$ iconv -f utf-8 -t iso-8859-1 test_iso1 >test_iso1.fixed
or
$ iconv -f iso-8859-1 -t utf-8 test_utf8 >test_utf8.fixed

Now run each script. The test_iso1 script should return a type of iso1, the test_utf8 script should return a type of utf8.

Workaround: append an extra character to the end of the string, and then remove it(!)
 [2006-04-17 15:41 UTC] hirokawa@php.net
It is not a bug, it is a specification.
You should use 'strict' mode in mb_detect_encoding() 
if you need to return correct result.

mb_detect_encoding() treat the string as byte-stream.
{0x61,0x63,0x63,0x65,0x6e,0x74,0x75,0xe9} is a correct
UTF-8 byte stream.
In this case, 0xe9 is treat as the first byte of
multibyte character. 

{0x61,0x63,0x63,0x65,0x6e,0x74,0x75,0xe9,0x65} is wrong
UTF-8 byte stream because 0xe965 is invalid byte sequence in
UTF-8.

If you need to remove the incomplete multibyte character from detection, please try to use 'strict' option like,
echo mb_detect_encoding($s1 , 'UTF-8, ISO-8859-1',true);

 [2006-04-18 07:16 UTC] ynynmzvqofeaz at mailinator dot com
So to use my example from before, why do both
 $string = "test?"
in a utf-8 text file, and 
 $string = "test?"
in an iso-8859-1 file (converted using iconv) return "UTF-8" with mb_detect_encoding, even when strict is on?
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sat Dec 21 13:01:31 2024 UTC