php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #75131 Iconv output buffering handler doesn't encode correctly across chunk boundaries
Submitted: 2017-08-28 13:52 UTC Modified: -
Votes:2
Avg. Score:4.5 ± 0.5
Reproduced:2 of 2 (100.0%)
Same Version:0 (0.0%)
Same OS:1 (50.0%)
From: jocrutrisi at ibsats dot com Assigned:
Status: Open Package: ICONV related
PHP Version: 7.2.0beta3 OS: Linux
Private report: No CVE-ID:
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: jocrutrisi at ibsats dot com
New email:
PHP Version: OS:

 

 [2017-08-28 13:52 UTC] jocrutrisi at ibsats dot com
Description:
------------
The "ob_iconv_handler" is the only means in PHP to convert streaming input from one charset to another charset.

We can't use any of the other functions, because as we process in chunks of undefined or fixed size, variable-width encodings like UTF-8 often end up with partial characters at the boundaries (start/end) of the chunk. So if we convert every chunk as if it's a complete string, we'll end up with corrupted output as the conversion can't see the entire characters.

Normally since "ob_iconv_handler" is advertized as a stream output handler, you'd think it handles this scenario correctly, alas it doesn't.

Output is correctly encoded when we give "complete" chunks to the handler. But if we cut down the buffer size so partial characters are sent in each chunk... we get garbled output.

This is especially troubling not only because it's not correct, but because this is supposedly the ONLY WAY to convert a stream from one charset to another. The only option in PHP right now, is to put an entire string in memory and convert it this way. If it doesn't fit in memory we're S.O.L.

There are many other issues with ob_iconv_handler - it relies on global settings, it has global state and is not reentrant... It'd be amazing if we had some sort of iconv_open iconv_read/write iconv_close API to handle these cases, but I digress...

Find the examples below reproducing the problem.

Test script:
---------------
--------------------------------------------------------------------------------
EXAMPLE 2:
--------------------------------------------------------------------------------

// Make sure display is right for browsers (also works in CLI if UTF8 is supported).
header('Content-Type: text/plain; charset=utf-8');

// UTF-8 sample text.
$t = 'Здравей!';
// We convert it to UTF-16LE, to then convert it back to UTF-8
$t = iconv('UTF-8', 'UTF-16LE', $t);

// We set-up ob_iconv_handler() to do UTF-16LE -> UTF-8 conversion
ini_set('internal_encoding', 'UTF-16LE');
ini_set('output_encoding', 'UTF-8');

// Prints "Здравей!" as expected.
ob_start('ob_iconv_handler', 4096);
for ($i = 0; $i<strlen($t); $i++) {
   echo $t[$i];
}
ob_end_flush();


--------------------------------------------------------------------------------
EXAMPLE 2:
--------------------------------------------------------------------------------

// Make sure display is right for browsers (also works in CLI if UTF8 is supported).
header('Content-Type: text/plain; charset=utf-8');

// UTF-8 sample text.
$t = 'Здравей!';
// We convert it to UTF-16LE, to then convert it back to UTF-8
$t = iconv('UTF-8', 'UTF-16LE', $t);

// We set-up ob_iconv_handler() to do UTF-16LE -> UTF-8 conversion
ini_set('internal_encoding', 'UTF-16LE');
ini_set('output_encoding', 'UTF-8');

// Prints GARBLED OUTPUT.
ob_start('ob_iconv_handler', 1);
for ($i = 0; $i<strlen($t); $i++) {
   echo $t[$i];
}
ob_end_flush();

Expected result:
----------------
Same correct output in both samples.

Actual result:
--------------
Garbled output in the second example.

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2017-08-28 14:49 UTC] tyzoid dot d at gmail dot com
Potential workaround using mb_strcut

---------
$mb_str="        0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F
U+1F60x 😀 😁 😂 😃 😄 😅 😆 😇 😈 😉 😊 😋 😌 😍 😎 😏
U+1F61x 😐 😑 😒 😓 😔 😕 😖 😗 😘 😙 😚 😛 😜 😝 😞 😟
U+1F62x 😠 😡 😢 😣 😤 😥 😦 😧 😨 😩 😪 😫 😬 😭 😮 😯
U+1F63x 😰 😱 😲 😳 😴 😵 😶 😷 😸 😹 😺 😻 😼 😽 😾 😿
U+1F64x 🙀 🙁 🙂 🙃 🙄 🙅 🙆 🙇 🙈 🙉 🙊 🙋 🙌 🙍 🙎 🙏
U+1F68x 🚀 🚁 🚂 🚃 🚄 🚅 🚆 🚇 🚈 🚉 🚊 🚋 🚌 🚍 🚎 🚏
U+1F69x 🚐 🚑 🚒 🚓 🚔 🚕 🚖 🚗 🚘 🚙 🚚 🚛 🚜 🚝 🚞 🚟
U+1F6Ax 🚠 🚡 🚢 🚣 🚤\n";

file_put_contents("emoji-16.txt", mb_convert_encoding($mb_str, 'UTF-16LE'));

$fh = fopen("emoji-16.txt", "r");

$input_format = 'UTF-16LE';
$databuf = "";
while (($databuf .= fread($fh, 1)) !== false && !feof($fh))
{
	// Get number of valid characters in the data buffer
	$str = mb_strcut($databuf, 0, null, $input_format);
	$databuf = substr($databuf, strlen($str));
	echo mb_convert_encoding($str, "UTF-8", $input_format);
}

if ($databuf)
{
	$str = mb_strcut($databuf, 0, null, $input_format);
	echo mb_convert_encoding($str, "UTF-8", $input_format);
}
----------
 
PHP Copyright © 2001-2017 The PHP Group
All rights reserved.
Last updated: Tue Aug 29 15:01:52 2017 UTC