php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #51563 Incorrect result
Submitted: 2010-04-15 16:06 UTC Modified: 2016-01-08 14:56 UTC
Votes:9
Avg. Score:4.2 ± 1.1
Reproduced:9 of 9 (100.0%)
Same Version:2 (22.2%)
Same OS:2 (22.2%)
From: zdenis at free dot fr Assigned:
Status: Not a bug Package: mbstring related
PHP Version: 5.3.2 OS: Windows
Private report: No CVE-ID: None
 [2010-04-15 16:06 UTC] zdenis at free dot fr
Description:
------------
When using mb_detect_encoding, depending on how many é characters - or any character above 127 - are present in the string, the detected charset is not consistent and then sometimes wrong.

Test script:
---------------
// little example
php -r "echo mb_detect_encoding(\"é\", 'UTF-8,ISO-8859-1');"
php -r "echo mb_detect_encoding(\"éé\", 'UTF-8,ISO-8859-1');"


// real life example
php -r "echo mb_detect_encoding(\"Produit commandé\", 'UTF-8,ISO-8859-1');"
php -r "echo mb_detect_encoding(\"Société\", 'UTF-8,ISO-8859-1');"


Expected result:
----------------
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1

Actual result:
--------------
UTF-8
ISO-8859-1
UTF-8
ISO-8859-1


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2010-05-06 01:23 UTC] felipe@php.net
-Status: Open +Status: Assigned -Assigned To: +Assigned To: moriyoshi
 [2012-12-09 06:03 UTC] jmichae3 at yahoo dot com
I am getting russian spam in my email forms. mb_detect_encoding() on my form mail content string shows as ASCII strangely enough! the characters are around the UNICODE Ѐ range.
this prevents me from detecting foreign language characters in my form mail.
please fix.

my code is

	//detect foreign languages
    $arr[0] = "ASCII";
    $arr[1] = "US-ASCII";
	if (false===mb_detect_encoding($comment,$arr,true)) {
		echo "<div style='color:red;'>ERRORB:".mb_detect_encoding($comment)."</div>";
		return true; //error
	}

and using the string I generated from charmap ЋϊЁγϋГИБЫЫЏωАДрмдп I get
ASCII for a result from that last mb_detect_encoding($comment)
 [2016-01-08 12:02 UTC] jodev4u at gmail dot com
I would like to add some precisions. The issue appear when the only non ASCII char is at the end of the string. So I guess that during loop on filters for encoding detection, the last char of the given string is lost.

See this little exemple to understand :

    $utf8String = urldecode('Test%C3%A9');
    $encoding = (mb_detect_encoding($utf8String,array('UTF-8','ISO-8859-1')));
    echo $encoding.PHP_EOL;

    $isoString = urldecode('Test%E9');
    $encoding = (mb_detect_encoding($isoString,array('UTF-8','ISO-8859-1')));
    echo $encoding.PHP_EOL;

Will prompt :
-------------
UTF-8
UTF-8

Whereas : 

    $utf8String = urldecode('Test%C3%A9');
    $encoding = (mb_detect_encoding($utf8String .' ', array('UTF-8','ISO-8859-1')));
    //                                            ^- Just add a stuffing char to the end of the string
    echo $encoding.PHP_EOL;

    $isoString = urldecode('Test%E9');
    $encoding = (mb_detect_encoding($isoString .' ', array('UTF-8','ISO-8859-1')));
    //                                           ^- Just add a stuffing char to the end of the string
    echo $encoding.PHP_EOL;

Will prompt (as attempted) :
----------------------------
UTF-8
ISO-8859-1

--
I d'ont know if it will help, but i've searched for the reason of these behavior into PHP source, and i wonder if in the https://github.com/php/php-src/blob/9d3cedbd880b5d7bce2c18b799b6ca8b3e58d257/ext/mbstring/mbstring.c PHP_FUNCTION(mb_detect_encoding) the call of zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s|zb", &str, &str_len, &encoding_list, &strict) doesn't init the str_len value well. Because this value is used after to init the mbfl_string object

    mbfl_string_init(&string);
    string.no_language = MBSTRG(language);
    string.val = (unsigned char *)str;
    string.len = str_len;

If not, i suppose the error to be into the mbfl_identify_encoding_name function, but i haven't got any leads.
 [2016-01-08 13:27 UTC] jodev4u at gmail dot com
For information, bug still exists in PHP 7.0.2 (cli) (built: Jan  6 2016 13:04:42) tested on 2016 January 8th on Windows.
 [2016-01-08 14:56 UTC] requinix@php.net
-Status: Assigned +Status: Not a bug -Assigned To: moriyoshi +Assigned To:
 [2016-01-08 14:56 UTC] requinix@php.net
mb_detect_encoding is quite possibly the most misunderstood function in PHP.

# False positives if the string ends with "bad" bytes (@zdenis, @jodev4u)

Each character encoding "filter" has two particular values on it that matter here: a flag and a status. The filter will set the flag if it outright rejects the string, such as by detecting an invalid or unexpected byte. The status can be used within the filter for state tracking but basically =0 if the filter is ready to begin processing the next character or >0 if the filter is the middle of dealing with a multibyte character.

Enter $strict, the third parameter to mb_detect_encoding().

If $strict=false (default) then the charset is valid if, after running its filter,
1. The filter's flag is not set
That's all. This means that a string like "é" will validate as UTF-8 because flag=0 (not rejected), even though state>0 (filter saw the 0xE9 as the beginning of a three-byte sequence).

If $strict=true then the charset is valid if
1. The filter's flag is not set, and
2. The filter's status is 0
Now "é" will not validate.

With @jodev4u's method of appending a space, the UTF-8 filter fails because of an unexpected byte. After processing "Test\xE9" it expects to see a byte 0x80-0xBF, however it actually gets 0x20, so it sets flag=1 and the filter fails. The next charset is ISO-8859-1, which validated successfully.

Moral of the story: you probably want to use $strict=true if you're using this function. Doing so produces the expected output.


As for @jmichae3, I can't reproduce. https://3v4l.org/YuLm6
 
PHP Copyright © 2001-2022 The PHP Group
All rights reserved.
Last updated: Sat Dec 03 02:05:54 2022 UTC