php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #65045 mb_convert_encoding breaks well-formed character
Submitted: 2013-06-16 23:17 UTC Modified: 2013-06-30 02:49 UTC
From: masakielastic at gmail dot com Assigned: hirokawa
Status: Closed Package: mbstring related
PHP Version: 5.5.0RC3 OS: Mac OSX
Private report: No CVE-ID:
 [2013-06-16 23:17 UTC] masakielastic at gmail dot com
Description:
------------
When converting string from UTF-8 to UTF-8 by using mb_convert_encoding for 
replacing ill-formed byte sequence with the substitute character(U+FFFD), 
mb_convert_encoding replaces the character follwing ill-formed byte sequence with 
the substitute character. mb_convert_encoding also delete trailing ill-formed byte 
sequence and doesn't replace it with the substitute character.

The comprehensive test case for 2-4 byte 
characters is here: https://gist.github.com/masakielastic/5793665 .

Test script:
---------------
// U+24B62: "\xF0\xA4\xAD\xA2"
// ill-formed: "\xF0\xA4\xAD"
// U+FFFD: "\xEF\xBF\xBD"

$str = "\xF0\xA4\xAD".  "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2";
$expected = "\xEF\xBF\xBD"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2";

$str2 = "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD";
$expected2 = "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xEF\xBF\xBD";

mb_substitute_character(0xFFFD);
var_dump(
    $expected === htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8')),
    $expected2 === htmlspecialchars_decode(htmlspecialchars($str2, ENT_SUBSTITUTE, 'UTF-8')), 
    $expected === mb_convert_encoding($str, 'UTF-8', 'UTF-8'),
    $expected2 === mb_convert_encoding($str2, 'UTF-8', 'UTF-8')
);

Expected result:
----------------
bool(true)
bool(true)
bool(true)
bool(true)

Actual result:
--------------
bool(true)
bool(true)
bool(false)
bool(false)

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2013-06-17 12:30 UTC] ab@php.net
-Status: Open +Status: Verified
 [2013-06-17 12:30 UTC] ab@php.net
I can reproduce that on windows too, the issue is probably not only osx. Here's 
slightly modified snippet:

<?php

$str1 = "\xF0\xA4\xAD" . "\xF0\xA4\xAD\xA2" . "\xF0\xA4\xAD\xA2";
$exp1 = "\xEF\xBF\xBD" . "\xF0\xA4\xAD\xA2" . "\xF0\xA4\xAD\xA2";

if (true !== mb_substitute_character(0xFFFD)) {
        die("can't set substitute char\n");
}

print_hex($str1);
$s = mb_convert_encoding($str1, 'UTF-8', mb_detect_encoding($str1));
print_hex($s);

function print_hex($s)
{
        for ($i = 0; $i < strlen($s); $i++) {
                echo "0x", dechex(ord($s[$i])), " ";
        }
echo "\n";
}

?>

And the output (added pipes as utf8 char separators manually)

0xf0 0xa4 0xad | 0xf0 0xa4 0xad 0xa2 | 0xf0 0xa4 0xad 0xa2

0xef 0xbf 0xbd | 0xef 0xbf 0xbd | 0xef 0xbf 0xbd | 0xef 0xbf 0xbd | 0xf0 0xa4 0xad 0xa2

As one can see, the first original invalid 3 byte sequence and the second valid 
4 byte sequence are replaced with "0xef 0xbf 0xbd", the last one remains. However looking at the codes only libmfl is in the game 
there http://lxr.php.net/xref/PHP_5_5/ext/mbstring/mbstring.c#3011 . Not sure yet to have overseen something, have to make a C 
snippet.
 [2013-06-30 02:49 UTC] hirokawa@php.net
-Status: Verified +Status: Feedback -Assigned To: +Assigned To: hirokawa
 [2013-06-30 02:49 UTC] hirokawa@php.net
This problem is caused by ill-formed utf-8 handling issue of libmbfl.
libmbfl is maintaining at https://github.com/moriyoshi/libmbfl.
Please try to use the newest version of libmbfl on github.
 [2013-06-30 06:32 UTC] hirokawa@php.net
Automatic comment on behalf of hirokawa
Revision: http://git.php.net/?p=php-src.git;a=commit;h=c6a7549efcca62346687b0fda5b408b963f5ab2d
Log: fixed #65045: mb_convert_encoding breaks well-formed character.
 [2013-06-30 06:32 UTC] hirokawa@php.net
-Status: Feedback +Status: Closed
 [2013-07-30 23:18 UTC] hirokawa@php.net
Automatic comment on behalf of hirokawa
Revision: http://git.php.net/?p=php-src.git;a=commit;h=c10d7e1afc63f0a0eaadb115560cc3ca626eb245
Log: MFH: fixed #65045: mb_convert_encoding breaks well-formed character.
 [2013-07-30 23:46 UTC] hirokawa@php.net
Automatic comment on behalf of hirokawa
Revision: http://git.php.net/?p=php-src.git;a=commit;h=0a974f14d13832838dcc7bae88b3271b7d035f46
Log: MFH: fixed #65045: mb_convert_encoding breaks well-formed character.
 [2013-07-31 10:25 UTC] dmitry@php.net
Automatic comment on behalf of hirokawa
Revision: http://git.php.net/?p=php-src.git;a=commit;h=c10d7e1afc63f0a0eaadb115560cc3ca626eb245
Log: MFH: fixed #65045: mb_convert_encoding breaks well-formed character.
 [2013-07-31 12:35 UTC] dmitry@php.net
Automatic comment on behalf of hirokawa
Revision: http://git.php.net/?p=php-src.git;a=commit;h=0a974f14d13832838dcc7bae88b3271b7d035f46
Log: MFH: fixed #65045: mb_convert_encoding breaks well-formed character.
 [2013-11-17 09:30 UTC] laruence@php.net
Automatic comment on behalf of hirokawa
Revision: http://git.php.net/?p=php-src.git;a=commit;h=c10d7e1afc63f0a0eaadb115560cc3ca626eb245
Log: MFH: fixed #65045: mb_convert_encoding breaks well-formed character.
 [2013-11-17 09:30 UTC] laruence@php.net
Automatic comment on behalf of hirokawa
Revision: http://git.php.net/?p=php-src.git;a=commit;h=c6a7549efcca62346687b0fda5b408b963f5ab2d
Log: fixed #65045: mb_convert_encoding breaks well-formed character.
 
PHP Copyright © 2001-2014 The PHP Group
All rights reserved.
Last updated: Wed Apr 23 09:02:23 2014 UTC