PHP :: Bug #35711 :: [PATCH] ISO-8859 charset not correctly detected

Bug #35711	[PATCH] ISO-8859 charset not correctly detected
Submitted:	2005-12-16 17:18 UTC	Modified:	2005-12-25 16:26 UTC
From:	matteo at beccati dot com	Assigned:	hirokawa (profile)
Status:	Closed	Package:	mbstring related
PHP Version:	5.1CVS-2005-12-24 (snap)	OS:	Debian GNU/Linux
Private report:	No	CVE-ID:	None

View Developer Edit

Anyone can comment on a bug. Have a simpler test case? Does it work for you on a different platform? Let us know!
Just going to say 'Me too!'? Don't clutter the database with that please !

Your email address: MUST BE VALID
Solve the problem: 17 + 26 = ?
Subscribe to this entry?

[2005-12-16 17:18 UTC] matteo at beccati dot com

Description:
------------
I was evaluating the mbstring extension because of its capabilities to filter and convert input parameter to the correct encoding. During my test I found out that an ISO-8859-1 string which ends with an an accented character is wrongly detected as UTF-8, even if it ends with an incomplete multibyte character (using iconv to convert the string raises such notice).

Also reproduced with PHP 4.3.11 on FreeBSD 4 and 5.0.2 on Win32.


Reproduce code:
---------------
<?php

error_reporting(E_ALL);
mb_detect_order('ASCII,UTF-8,ISO-8859-1');

// \xE0 is ISO-8859-1 small a grave char
test_bug("Test: \xE0");
test_bug("Test: \xE0a");

function test_bug($s) {
    echo "Trying: ";
    var_dump($s);
    iconv('UTF8', 'UCS2', $s);
    echo "Detected encoding: ".mb_detect_encoding($s)."\n";
    echo "Converted string:";
    var_dump(mb_convert_encoding($s, 'UTF-8',
        'ASCII,UTF-8,ISO-8859-1'));
    echo "\n";
}

?>

Expected result:
----------------
Trying: string(7) "Test: ?"

Notice: iconv(): Detected an incomplete multibyte character in input string in test.php on line 13
Detected encoding: ISO-8859-1
Converted string:string(8) "Test: ? "

Trying: string(8) "Test: ?a"

Notice: iconv(): Detected an illegal character in input string in /var/www/mbstring/test.php on line 13
Detected encoding: ISO-8859-1
Converted string:string(9) "Test: ? a"


Actual result:
--------------
Trying: string(7) "Test: ?"

Notice: iconv(): Detected an incomplete multibyte character in input string in test.php on line 13
Detected encoding: UTF-8
Converted string:string(6) "Test: "

Trying: string(8) "Test: ?a"

Notice: iconv(): Detected an illegal character in input string in /var/www/mbstring/test.php on line 13
Detected encoding: ISO-8859-1
Converted string:string(9) "Test: ? a"

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2005-12-16 23:50 UTC] matteo at beccati dot com

I've made a patch which seems to fix the issue. It basicly checks filter status during judgement. Status seems to be != 0 only when it is matching a multibyte character. I added anyway a fallback to the old judgement routine, just in case no matching encoding is found.

Index: ext/mbstring/libmbfl/mbfl/mbfilter.c
===================================================================
RCS file: /repository/php-src/ext/mbstring/libmbfl/mbfl/mbfilter.c,v
retrieving revision 1.7.2.1
diff -u -r1.7.2.1 mbfilter.c
--- ext/mbstring/libmbfl/mbfl/mbfilter.c        5 Nov 2005 04:49:57 -0000      1.7.2.1
+++ ext/mbstring/libmbfl/mbfl/mbfilter.c        16 Dec 2005 22:46:26 -0000
@@ -575,12 +575,22 @@

        for (i = 0; i < num; i++) {
                filter = &flist[i];
-               if (!filter->flag) {
+               if (!filter->flag && !filter->status) {
                        encoding = filter->encoding;
                        break;
                }
        }

+       if (!encoding) {
+               for (i = 0; i < num; i++) {
+                       filter = &flist[i];
+                       if (!filter->flag) {
+                               encoding = filter->encoding;
+                               break;
+                       }
+               }
+       }
+
        /* cleanup */
        /* dtors should be called in reverse order */
        i = num; while (--i >= 0) {

[2005-12-19 09:00 UTC] matteo at beccati dot com

Oops, I just realized that I forgot the -u flag :)

Here is the downlaodable patch:

http://beccati.com/download/mbstring-patch-20051219.txt

[2005-12-19 09:03 UTC] sniper@php.net

Rui, can you check this out please?

[2005-12-20 15:44 UTC] hirokawa@php.net

Please note that encoding detection is not always perfect.
Especially, when the string is too short, the wrong detection might be caused.
In your case, it is not a bug, but it is the specification.
UTF-8 is a variable length multibyte encoding format,
the length of a character in UTF-8 is from one to six.
Please look at ext/mbstring/libmbfl/filter/mbfilter_utf8.c:about 249L.
0xe8 is a valid byte sequence as the 1st character of 3 byte code.
We cannot detect 0xe8 is ISO-8859-1 or UTF-8,
because this byte is valid in both encodings.
In this case, the response will be choose 
from the order defined by mb_detect_order().
I suggest to use the sufficient length of string
for the reliable encoding detection.

[2005-12-20 17:10 UTC] matteo at beccati dot com

Of course, I agree that 0xe8 is a valid if taken as part of a multibyte character, but I don't think it could be considered valid it the next bytes are missing (because the string ends prematurely). The iconv extension raises notices when it finds illegal or incomplete multibyte characters, I don't see why mbstring should accept as a valid UTF-8 a string which indeed isn't.

The same should apply to other multibyte encodings.

[2005-12-24 01:03 UTC] hirokawa@php.net

Have you ever tried the strict mode (default:FALSE) ?

string mb_detect_encoding ( string str [, mixed encoding_list [, bool strict]] )

[2005-12-24 02:23 UTC] hirokawa@php.net

This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.

The character-end detection was introduced in the strict mode (mb_detect_encoding ($s,$list,TRUE)).
Please try the strict mode.

[2005-12-24 12:30 UTC] matteo at beccati dot com

These are great news and I'm really thankful for your help. Now mb_detect_encoding is correctly working when the strict flag is set, but...

- There's no way to set the strict flag in mb_convert_encoding; however one could use mb_detect_encoding with the strict flag as source charset.

- There's no way to set the strict flag for http_input translation, which indeed would be much more useful (that's how I found the problem described here).

[2005-12-24 13:59 UTC] matteo at beccati dot com

I've made a patch which adds an mbstring.strict_detection php.ini flag that specifies the default behaviour (defaults to off). I just started taking a look to PHP internals so I could have made mistakes; make test passes the mbstring related checks, I'll do more tests later.

http://beccati.com/download/mbstring-patch-20051224.txt

[2005-12-25 16:26 UTC] hirokawa@php.net

This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.

mbstring.strict_detection is introduced to specify the strict mode encoding detection.

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Tue Jul 15 23:01:33 2025 UTC