PHP :: Bug #75131 :: Iconv output buffering handler doesn't encode correctly across chunk boundaries

Bug #75131

Iconv output buffering handler doesn't encode correctly across chunk boundaries

Submitted:

2017-08-28 13:52 UTC

Modified:

2018-08-26 16:51 UTC

Votes:	3
Avg. Score:	4.7 ± 0.5
Reproduced:	3 of 3 (100.0%)
Same Version:	1 (33.3%)
Same OS:	2 (66.7%)

From:

jocrutrisi at ibsats dot com

Assigned:

Status:

Verified

Package:

ICONV related

PHP Version:

7.2.0beta3

OS:

Linux

Private report:

CVE-ID:

None

View Developer Edit

Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.

php.net Username: php.net Password:

Quick Fix:	(description)
	Block user comment
Status:		Assign to:
Package:
Bug Type:
Summary:
From:	jocrutrisi at ibsats dot com
New email:
PHP Version:		OS:

New/Additional Comment:

[2017-08-28 13:52 UTC] jocrutrisi at ibsats dot com

Description:
------------
The "ob_iconv_handler" is the only means in PHP to convert streaming input from one charset to another charset.

We can't use any of the other functions, because as we process in chunks of undefined or fixed size, variable-width encodings like UTF-8 often end up with partial characters at the boundaries (start/end) of the chunk. So if we convert every chunk as if it's a complete string, we'll end up with corrupted output as the conversion can't see the entire characters.

Normally since "ob_iconv_handler" is advertized as a stream output handler, you'd think it handles this scenario correctly, alas it doesn't.

Output is correctly encoded when we give "complete" chunks to the handler. But if we cut down the buffer size so partial characters are sent in each chunk... we get garbled output.

This is especially troubling not only because it's not correct, but because this is supposedly the ONLY WAY to convert a stream from one charset to another. The only option in PHP right now, is to put an entire string in memory and convert it this way. If it doesn't fit in memory we're S.O.L.

There are many other issues with ob_iconv_handler - it relies on global settings, it has global state and is not reentrant... It'd be amazing if we had some sort of iconv_open iconv_read/write iconv_close API to handle these cases, but I digress...

Find the examples below reproducing the problem.

Test script:
---------------
--------------------------------------------------------------------------------
EXAMPLE 2:
--------------------------------------------------------------------------------

// Make sure display is right for browsers (also works in CLI if UTF8 is supported).
header('Content-Type: text/plain; charset=utf-8');

// UTF-8 sample text.
$t = 'Здравей!';
// We convert it to UTF-16LE, to then convert it back to UTF-8
$t = iconv('UTF-8', 'UTF-16LE', $t);

// We set-up ob_iconv_handler() to do UTF-16LE -> UTF-8 conversion
ini_set('internal_encoding', 'UTF-16LE');
ini_set('output_encoding', 'UTF-8');

// Prints "Здравей!" as expected.
ob_start('ob_iconv_handler', 4096);
for ($i = 0; $i<strlen($t); $i++) {
   echo $t[$i];
}
ob_end_flush();


--------------------------------------------------------------------------------
EXAMPLE 2:
--------------------------------------------------------------------------------

// Make sure display is right for browsers (also works in CLI if UTF8 is supported).
header('Content-Type: text/plain; charset=utf-8');

// UTF-8 sample text.
$t = 'Здравей!';
// We convert it to UTF-16LE, to then convert it back to UTF-8
$t = iconv('UTF-8', 'UTF-16LE', $t);

// We set-up ob_iconv_handler() to do UTF-16LE -> UTF-8 conversion
ini_set('internal_encoding', 'UTF-16LE');
ini_set('output_encoding', 'UTF-8');

// Prints GARBLED OUTPUT.
ob_start('ob_iconv_handler', 1);
for ($i = 0; $i<strlen($t); $i++) {
   echo $t[$i];
}
ob_end_flush();

Expected result:
----------------
Same correct output in both samples.

Actual result:
--------------
Garbled output in the second example.

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2017-08-28 14:49 UTC] tyzoid dot d at gmail dot com

Potential workaround using mb_strcut

---------
$mb_str="        0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F
U+1F60x ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
U+1F61x ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
U+1F62x ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
U+1F63x ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
U+1F64x ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
U+1F68x ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
U+1F69x ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
U+1F6Ax ? ? ? ? ?\n";

file_put_contents("emoji-16.txt", mb_convert_encoding($mb_str, 'UTF-16LE'));

$fh = fopen("emoji-16.txt", "r");

$input_format = 'UTF-16LE';
$databuf = "";
while (($databuf .= fread($fh, 1)) !== false && !feof($fh))
{
	// Get number of valid characters in the data buffer
	$str = mb_strcut($databuf, 0, null, $input_format);
	$databuf = substr($databuf, strlen($str));
	echo mb_convert_encoding($str, "UTF-8", $input_format);
}

if ($databuf)
{
	$str = mb_strcut($databuf, 0, null, $input_format);
	echo mb_convert_encoding($str, "UTF-8", $input_format);
}
----------

[2018-08-26 16:51 UTC] cmb@php.net

-Status: Open +Status: Verified

[2018-08-26 16:51 UTC] cmb@php.net

Confirmed.  The problem is that php_iconv_output_handler()[1]
ignores and forgets partially passed multibyte characters.

> The "ob_iconv_handler" is the only means in PHP to convert
> streaming input from one charset to another charset.

Not quite.  There are also the convert.iconv.* stream filters[2],
and mb_output_handler()[3].

[1] <https://github.com/php/php-src/blob/php-7.3.0beta2/ext/iconv/iconv.c#L394>
[2] <http://php.net/manual/en/filters.convert.php>
[3] <http://php.net/manual/en/function.mb-output-handler.php>

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Wed Jul 02 07:01:33 2025 UTC