|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2013-06-16 23:17 UTC] masakielastic at gmail dot com
Description: ------------ When converting string from UTF-8 to UTF-8 by using mb_convert_encoding for replacing ill-formed byte sequence with the substitute character(U+FFFD), mb_convert_encoding replaces the character follwing ill-formed byte sequence with the substitute character. mb_convert_encoding also delete trailing ill-formed byte sequence and doesn't replace it with the substitute character. The comprehensive test case for 2-4 byte characters is here: https://gist.github.com/masakielastic/5793665 . Test script: --------------- // U+24B62: "\xF0\xA4\xAD\xA2" // ill-formed: "\xF0\xA4\xAD" // U+FFFD: "\xEF\xBF\xBD" $str = "\xF0\xA4\xAD". "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"; $expected = "\xEF\xBF\xBD"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"; $str2 = "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD"; $expected2 = "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xEF\xBF\xBD"; mb_substitute_character(0xFFFD); var_dump( $expected === htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8')), $expected2 === htmlspecialchars_decode(htmlspecialchars($str2, ENT_SUBSTITUTE, 'UTF-8')), $expected === mb_convert_encoding($str, 'UTF-8', 'UTF-8'), $expected2 === mb_convert_encoding($str2, 'UTF-8', 'UTF-8') ); Expected result: ---------------- bool(true) bool(true) bool(true) bool(true) Actual result: -------------- bool(true) bool(true) bool(false) bool(false) PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Wed Nov 05 11:00:02 2025 UTC |
I can reproduce that on windows too, the issue is probably not only osx. Here's slightly modified snippet: <?php $str1 = "\xF0\xA4\xAD" . "\xF0\xA4\xAD\xA2" . "\xF0\xA4\xAD\xA2"; $exp1 = "\xEF\xBF\xBD" . "\xF0\xA4\xAD\xA2" . "\xF0\xA4\xAD\xA2"; if (true !== mb_substitute_character(0xFFFD)) { die("can't set substitute char\n"); } print_hex($str1); $s = mb_convert_encoding($str1, 'UTF-8', mb_detect_encoding($str1)); print_hex($s); function print_hex($s) { for ($i = 0; $i < strlen($s); $i++) { echo "0x", dechex(ord($s[$i])), " "; } echo "\n"; } ?> And the output (added pipes as utf8 char separators manually) 0xf0 0xa4 0xad | 0xf0 0xa4 0xad 0xa2 | 0xf0 0xa4 0xad 0xa2 0xef 0xbf 0xbd | 0xef 0xbf 0xbd | 0xef 0xbf 0xbd | 0xef 0xbf 0xbd | 0xf0 0xa4 0xad 0xa2 As one can see, the first original invalid 3 byte sequence and the second valid 4 byte sequence are replaced with "0xef 0xbf 0xbd", the last one remains. However looking at the codes only libmfl is in the game there http://lxr.php.net/xref/PHP_5_5/ext/mbstring/mbstring.c#3011 . Not sure yet to have overseen something, have to make a C snippet.