PHP :: Bug #51563 :: Incorrect result

Bug #51563

Incorrect result

Submitted:

2010-04-15 16:06 UTC

Modified:

2016-01-08 14:56 UTC

Votes:	9
Avg. Score:	4.2 ± 1.1
Reproduced:	9 of 9 (100.0%)
Same Version:	2 (22.2%)
Same OS:	2 (22.2%)

From:

zdenis at free dot fr

Assigned:

Status:

Not a bug

Package:

mbstring related

PHP Version:

5.3.2

OS:

Windows

Private report:

CVE-ID:

None

View Developer Edit

[2010-04-15 16:06 UTC] zdenis at free dot fr

Description:
------------
When using mb_detect_encoding, depending on how many é characters - or any character above 127 - are present in the string, the detected charset is not consistent and then sometimes wrong.

Test script:
---------------
// little example
php -r "echo mb_detect_encoding(\"é\", 'UTF-8,ISO-8859-1');"
php -r "echo mb_detect_encoding(\"éé\", 'UTF-8,ISO-8859-1');"


// real life example
php -r "echo mb_detect_encoding(\"Produit commandé\", 'UTF-8,ISO-8859-1');"
php -r "echo mb_detect_encoding(\"Société\", 'UTF-8,ISO-8859-1');"


Expected result:
----------------
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1

Actual result:
--------------
UTF-8
ISO-8859-1
UTF-8
ISO-8859-1

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2010-05-06 01:23 UTC] felipe@php.net

-Status: Open +Status: Assigned -Assigned To: +Assigned To: moriyoshi

[2012-12-09 06:03 UTC] jmichae3 at yahoo dot com

I am getting russian spam in my email forms. mb_detect_encoding() on my form mail content string shows as ASCII strangely enough! the characters are around the UNICODE &#1024; range.
this prevents me from detecting foreign language characters in my form mail.
please fix.

my code is

	//detect foreign languages
    $arr[0] = "ASCII";
    $arr[1] = "US-ASCII";
	if (false===mb_detect_encoding($comment,$arr,true)) {
		echo "<div style='color:red;'>ERRORB:".mb_detect_encoding($comment)."</div>";
		return true; //error
	}

and using the string I generated from charmap ЋϊЁγϋГИБЫЫЏωАДрмдп I get
ASCII for a result from that last mb_detect_encoding($comment)

[2016-01-08 12:02 UTC] jodev4u at gmail dot com

I would like to add some precisions. The issue appear when the only non ASCII char is at the end of the string. So I guess that during loop on filters for encoding detection, the last char of the given string is lost.

See this little exemple to understand :

    $utf8String = urldecode('Test%C3%A9');
    $encoding = (mb_detect_encoding($utf8String,array('UTF-8','ISO-8859-1')));
    echo $encoding.PHP_EOL;

    $isoString = urldecode('Test%E9');
    $encoding = (mb_detect_encoding($isoString,array('UTF-8','ISO-8859-1')));
    echo $encoding.PHP_EOL;

Will prompt :
-------------
UTF-8
UTF-8

Whereas : 

    $utf8String = urldecode('Test%C3%A9');
    $encoding = (mb_detect_encoding($utf8String .' ', array('UTF-8','ISO-8859-1')));
    //                                            ^- Just add a stuffing char to the end of the string
    echo $encoding.PHP_EOL;

    $isoString = urldecode('Test%E9');
    $encoding = (mb_detect_encoding($isoString .' ', array('UTF-8','ISO-8859-1')));
    //                                           ^- Just add a stuffing char to the end of the string
    echo $encoding.PHP_EOL;

Will prompt (as attempted) :
----------------------------
UTF-8
ISO-8859-1

--
I d'ont know if it will help, but i've searched for the reason of these behavior into PHP source, and i wonder if in the https://github.com/php/php-src/blob/9d3cedbd880b5d7bce2c18b799b6ca8b3e58d257/ext/mbstring/mbstring.c PHP_FUNCTION(mb_detect_encoding) the call of zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s|zb", &str, &str_len, &encoding_list, &strict) doesn't init the str_len value well. Because this value is used after to init the mbfl_string object

    mbfl_string_init(&string);
    string.no_language = MBSTRG(language);
    string.val = (unsigned char *)str;
    string.len = str_len;

If not, i suppose the error to be into the mbfl_identify_encoding_name function, but i haven't got any leads.

[2016-01-08 13:27 UTC] jodev4u at gmail dot com

For information, bug still exists in PHP 7.0.2 (cli) (built: Jan  6 2016 13:04:42) tested on 2016 January 8th on Windows.

[2016-01-08 14:56 UTC] requinix@php.net

-Status: Assigned +Status: Not a bug -Assigned To: moriyoshi +Assigned To:

[2016-01-08 14:56 UTC] requinix@php.net

mb_detect_encoding is quite possibly the most misunderstood function in PHP.

# False positives if the string ends with "bad" bytes (@zdenis, @jodev4u)

Each character encoding "filter" has two particular values on it that matter here: a flag and a status. The filter will set the flag if it outright rejects the string, such as by detecting an invalid or unexpected byte. The status can be used within the filter for state tracking but basically =0 if the filter is ready to begin processing the next character or >0 if the filter is the middle of dealing with a multibyte character.

Enter $strict, the third parameter to mb_detect_encoding().

If $strict=false (default) then the charset is valid if, after running its filter,
1. The filter's flag is not set
That's all. This means that a string like "é" will validate as UTF-8 because flag=0 (not rejected), even though state>0 (filter saw the 0xE9 as the beginning of a three-byte sequence).

If $strict=true then the charset is valid if
1. The filter's flag is not set, and
2. The filter's status is 0
Now "é" will not validate.

With @jodev4u's method of appending a space, the UTF-8 filter fails because of an unexpected byte. After processing "Test\xE9" it expects to see a byte 0x80-0xBF, however it actually gets 0x20, so it sets flag=1 and the filter fails. The next charset is ISO-8859-1, which validated successfully.

Moral of the story: you probably want to use $strict=true if you're using this function. Doing so produces the expected output.

As for @jmichae3, I can't reproduce. https://3v4l.org/YuLm6

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Mon Oct 27 09:00:02 2025 UTC