php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #35711 [PATCH] ISO-8859 charset not correctly detected
Submitted: 2005-12-16 17:18 UTC Modified: 2005-12-25 16:26 UTC
From: matteo at beccati dot com Assigned: hirokawa (profile)
Status: Closed Package: mbstring related
PHP Version: 5.1CVS-2005-12-24 (snap) OS: Debian GNU/Linux
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: matteo at beccati dot com
New email:
PHP Version: OS:

 

 [2005-12-16 17:18 UTC] matteo at beccati dot com
Description:
------------
I was evaluating the mbstring extension because of its capabilities to filter and convert input parameter to the correct encoding. During my test I found out that an ISO-8859-1 string which ends with an an accented character is wrongly detected as UTF-8, even if it ends with an incomplete multibyte character (using iconv to convert the string raises such notice).

Also reproduced with PHP 4.3.11 on FreeBSD 4 and 5.0.2 on Win32.


Reproduce code:
---------------
<?php

error_reporting(E_ALL);
mb_detect_order('ASCII,UTF-8,ISO-8859-1');

// \xE0 is ISO-8859-1 small a grave char
test_bug("Test: \xE0");
test_bug("Test: \xE0a");

function test_bug($s) {
    echo "Trying: ";
    var_dump($s);
    iconv('UTF8', 'UCS2', $s);
    echo "Detected encoding: ".mb_detect_encoding($s)."\n";
    echo "Converted string:";
    var_dump(mb_convert_encoding($s, 'UTF-8',
        'ASCII,UTF-8,ISO-8859-1'));
    echo "\n";
}

?>

Expected result:
----------------
Trying: string(7) "Test: ?"

Notice: iconv(): Detected an incomplete multibyte character in input string in test.php on line 13
Detected encoding: ISO-8859-1
Converted string:string(8) "Test: ? "

Trying: string(8) "Test: ?a"

Notice: iconv(): Detected an illegal character in input string in /var/www/mbstring/test.php on line 13
Detected encoding: ISO-8859-1
Converted string:string(9) "Test: ? a"


Actual result:
--------------
Trying: string(7) "Test: ?"

Notice: iconv(): Detected an incomplete multibyte character in input string in test.php on line 13
Detected encoding: UTF-8
Converted string:string(6) "Test: "

Trying: string(8) "Test: ?a"

Notice: iconv(): Detected an illegal character in input string in /var/www/mbstring/test.php on line 13
Detected encoding: ISO-8859-1
Converted string:string(9) "Test: ? a"


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2005-12-16 23:50 UTC] matteo at beccati dot com
I've made a patch which seems to fix the issue. It basicly checks filter status during judgement. Status seems to be != 0 only when it is matching a multibyte character. I added anyway a fallback to the old judgement routine, just in case no matching encoding is found.

Index: ext/mbstring/libmbfl/mbfl/mbfilter.c
===================================================================
RCS file: /repository/php-src/ext/mbstring/libmbfl/mbfl/mbfilter.c,v
retrieving revision 1.7.2.1
diff -u -r1.7.2.1 mbfilter.c
--- ext/mbstring/libmbfl/mbfl/mbfilter.c        5 Nov 2005 04:49:57 -0000      1.7.2.1
+++ ext/mbstring/libmbfl/mbfl/mbfilter.c        16 Dec 2005 22:46:26 -0000
@@ -575,12 +575,22 @@

        for (i = 0; i < num; i++) {
                filter = &flist[i];
-               if (!filter->flag) {
+               if (!filter->flag && !filter->status) {
                        encoding = filter->encoding;
                        break;
                }
        }

+       if (!encoding) {
+               for (i = 0; i < num; i++) {
+                       filter = &flist[i];
+                       if (!filter->flag) {
+                               encoding = filter->encoding;
+                               break;
+                       }
+               }
+       }
+
        /* cleanup */
        /* dtors should be called in reverse order */
        i = num; while (--i >= 0) {
 [2005-12-19 09:00 UTC] matteo at beccati dot com
Oops, I just realized that I forgot the -u flag :)

Here is the downlaodable patch:

http://beccati.com/download/mbstring-patch-20051219.txt
 [2005-12-19 09:03 UTC] sniper@php.net
Rui, can you check this out please?
 [2005-12-20 15:44 UTC] hirokawa@php.net
Please note that encoding detection is not always perfect.
Especially, when the string is too short, the wrong detection might be caused.
In your case, it is not a bug, but it is the specification.
UTF-8 is a variable length multibyte encoding format,
the length of a character in UTF-8 is from one to six.
Please look at ext/mbstring/libmbfl/filter/mbfilter_utf8.c:about 249L.
0xe8 is a valid byte sequence as the 1st character of 3 byte code.
We cannot detect 0xe8 is ISO-8859-1 or UTF-8,
because this byte is valid in both encodings.
In this case, the response will be choose 
from the order defined by mb_detect_order().
I suggest to use the sufficient length of string
for the reliable encoding detection.










 [2005-12-20 17:10 UTC] matteo at beccati dot com
Of course, I agree that 0xe8 is a valid if taken as part of a multibyte character, but I don't think it could be considered valid it the next bytes are missing (because the string ends prematurely). The iconv extension raises notices when it finds illegal or incomplete multibyte characters, I don't see why mbstring should accept as a valid UTF-8 a string which indeed isn't.

The same should apply to other multibyte encodings.
 [2005-12-24 01:03 UTC] hirokawa@php.net
Have you ever tried the strict mode (default:FALSE) ?

string mb_detect_encoding ( string str [, mixed encoding_list [, bool strict]] )

 [2005-12-24 02:23 UTC] hirokawa@php.net
This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.

The character-end detection was introduced in the strict mode (mb_detect_encoding ($s,$list,TRUE)).
Please try the strict mode.




 [2005-12-24 12:30 UTC] matteo at beccati dot com
These are great news and I'm really thankful for your help. Now mb_detect_encoding is correctly working when the strict flag is set, but...

- There's no way to set the strict flag in mb_convert_encoding; however one could use mb_detect_encoding with the strict flag as source charset.

- There's no way to set the strict flag for http_input translation, which indeed would be much more useful (that's how I found the problem described here).
 [2005-12-24 13:59 UTC] matteo at beccati dot com
I've made a patch which adds an mbstring.strict_detection php.ini flag that specifies the default behaviour (defaults to off). I just started taking a look to PHP internals so I could have made mistakes; make test passes the mbstring related checks, I'll do more tests later.

http://beccati.com/download/mbstring-patch-20051224.txt
 [2005-12-25 16:26 UTC] hirokawa@php.net
This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.

mbstring.strict_detection is introduced to specify the strict mode encoding detection.

 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 17:01:32 2024 UTC