php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #65045 mb_convert_encoding breaks well-formed character
Submitted: 2013-06-16 23:17 UTC Modified: 2013-06-30 02:49 UTC
From: masakielastic at gmail dot com Assigned: hirokawa (profile)
Status: Closed Package: mbstring related
PHP Version: 5.5.0RC3 OS: Mac OSX
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: masakielastic at gmail dot com
New email:
PHP Version: OS:

 

 [2013-06-16 23:17 UTC] masakielastic at gmail dot com
Description:
------------
When converting string from UTF-8 to UTF-8 by using mb_convert_encoding for 
replacing ill-formed byte sequence with the substitute character(U+FFFD), 
mb_convert_encoding replaces the character follwing ill-formed byte sequence with 
the substitute character. mb_convert_encoding also delete trailing ill-formed byte 
sequence and doesn't replace it with the substitute character.

The comprehensive test case for 2-4 byte 
characters is here: https://gist.github.com/masakielastic/5793665 .

Test script:
---------------
// U+24B62: "\xF0\xA4\xAD\xA2"
// ill-formed: "\xF0\xA4\xAD"
// U+FFFD: "\xEF\xBF\xBD"

$str = "\xF0\xA4\xAD".  "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2";
$expected = "\xEF\xBF\xBD"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2";

$str2 = "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD";
$expected2 = "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xEF\xBF\xBD";

mb_substitute_character(0xFFFD);
var_dump(
    $expected === htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8')),
    $expected2 === htmlspecialchars_decode(htmlspecialchars($str2, ENT_SUBSTITUTE, 'UTF-8')), 
    $expected === mb_convert_encoding($str, 'UTF-8', 'UTF-8'),
    $expected2 === mb_convert_encoding($str2, 'UTF-8', 'UTF-8')
);

Expected result:
----------------
bool(true)
bool(true)
bool(true)
bool(true)

Actual result:
--------------
bool(true)
bool(true)
bool(false)
bool(false)

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2013-06-17 12:30 UTC] ab@php.net
-Status: Open +Status: Verified
 [2013-06-17 12:30 UTC] ab@php.net
I can reproduce that on windows too, the issue is probably not only osx. Here's 
slightly modified snippet:

<?php

$str1 = "\xF0\xA4\xAD" . "\xF0\xA4\xAD\xA2" . "\xF0\xA4\xAD\xA2";
$exp1 = "\xEF\xBF\xBD" . "\xF0\xA4\xAD\xA2" . "\xF0\xA4\xAD\xA2";

if (true !== mb_substitute_character(0xFFFD)) {
        die("can't set substitute char\n");
}

print_hex($str1);
$s = mb_convert_encoding($str1, 'UTF-8', mb_detect_encoding($str1));
print_hex($s);

function print_hex($s)
{
        for ($i = 0; $i < strlen($s); $i++) {
                echo "0x", dechex(ord($s[$i])), " ";
        }
echo "\n";
}

?>

And the output (added pipes as utf8 char separators manually)

0xf0 0xa4 0xad | 0xf0 0xa4 0xad 0xa2 | 0xf0 0xa4 0xad 0xa2

0xef 0xbf 0xbd | 0xef 0xbf 0xbd | 0xef 0xbf 0xbd | 0xef 0xbf 0xbd | 0xf0 0xa4 0xad 0xa2

As one can see, the first original invalid 3 byte sequence and the second valid 
4 byte sequence are replaced with "0xef 0xbf 0xbd", the last one remains. However looking at the codes only libmfl is in the game 
there http://lxr.php.net/xref/PHP_5_5/ext/mbstring/mbstring.c#3011 . Not sure yet to have overseen something, have to make a C 
snippet.
 [2013-06-30 02:49 UTC] hirokawa@php.net
-Status: Verified +Status: Feedback -Assigned To: +Assigned To: hirokawa
 [2013-06-30 02:49 UTC] hirokawa@php.net
This problem is caused by ill-formed utf-8 handling issue of libmbfl.
libmbfl is maintaining at https://github.com/moriyoshi/libmbfl.
Please try to use the newest version of libmbfl on github.
 [2013-06-30 06:32 UTC] hirokawa@php.net
Automatic comment on behalf of hirokawa
Revision: http://git.php.net/?p=php-src.git;a=commit;h=c6a7549efcca62346687b0fda5b408b963f5ab2d
Log: fixed #65045: mb_convert_encoding breaks well-formed character.
 [2013-06-30 06:32 UTC] hirokawa@php.net
-Status: Feedback +Status: Closed
 [2013-07-30 23:18 UTC] hirokawa@php.net
Automatic comment on behalf of hirokawa
Revision: http://git.php.net/?p=php-src.git;a=commit;h=c10d7e1afc63f0a0eaadb115560cc3ca626eb245
Log: MFH: fixed #65045: mb_convert_encoding breaks well-formed character.
 [2013-07-30 23:46 UTC] hirokawa@php.net
Automatic comment on behalf of hirokawa
Revision: http://git.php.net/?p=php-src.git;a=commit;h=0a974f14d13832838dcc7bae88b3271b7d035f46
Log: MFH: fixed #65045: mb_convert_encoding breaks well-formed character.
 [2013-07-31 10:25 UTC] dmitry@php.net
Automatic comment on behalf of hirokawa
Revision: http://git.php.net/?p=php-src.git;a=commit;h=c10d7e1afc63f0a0eaadb115560cc3ca626eb245
Log: MFH: fixed #65045: mb_convert_encoding breaks well-formed character.
 [2013-07-31 12:35 UTC] dmitry@php.net
Automatic comment on behalf of hirokawa
Revision: http://git.php.net/?p=php-src.git;a=commit;h=0a974f14d13832838dcc7bae88b3271b7d035f46
Log: MFH: fixed #65045: mb_convert_encoding breaks well-formed character.
 [2013-11-17 09:30 UTC] laruence@php.net
Automatic comment on behalf of hirokawa
Revision: http://git.php.net/?p=php-src.git;a=commit;h=c10d7e1afc63f0a0eaadb115560cc3ca626eb245
Log: MFH: fixed #65045: mb_convert_encoding breaks well-formed character.
 [2013-11-17 09:30 UTC] laruence@php.net
Automatic comment on behalf of hirokawa
Revision: http://git.php.net/?p=php-src.git;a=commit;h=c6a7549efcca62346687b0fda5b408b963f5ab2d
Log: fixed #65045: mb_convert_encoding breaks well-formed character.
 [2014-10-07 23:17 UTC] stas@php.net
Automatic comment on behalf of hirokawa
Revision: http://git.php.net/?p=php-src-security.git;a=commit;h=0a974f14d13832838dcc7bae88b3271b7d035f46
Log: MFH: fixed #65045: mb_convert_encoding breaks well-formed character.
 [2014-10-07 23:28 UTC] stas@php.net
Automatic comment on behalf of hirokawa
Revision: http://git.php.net/?p=php-src-security.git;a=commit;h=0a974f14d13832838dcc7bae88b3271b7d035f46
Log: MFH: fixed #65045: mb_convert_encoding breaks well-formed character.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 19:01:29 2024 UTC