php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #75131 Iconv output buffering handler doesn't encode correctly across chunk boundaries
Submitted: 2017-08-28 13:52 UTC Modified: 2018-08-26 16:51 UTC
Votes:3
Avg. Score:4.7 ± 0.5
Reproduced:3 of 3 (100.0%)
Same Version:1 (33.3%)
Same OS:2 (66.7%)
From: jocrutrisi at ibsats dot com Assigned:
Status: Verified Package: ICONV related
PHP Version: 7.2.0beta3 OS: Linux
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: jocrutrisi at ibsats dot com
New email:
PHP Version: OS:

 

 [2017-08-28 13:52 UTC] jocrutrisi at ibsats dot com
Description:
------------
The "ob_iconv_handler" is the only means in PHP to convert streaming input from one charset to another charset.

We can't use any of the other functions, because as we process in chunks of undefined or fixed size, variable-width encodings like UTF-8 often end up with partial characters at the boundaries (start/end) of the chunk. So if we convert every chunk as if it's a complete string, we'll end up with corrupted output as the conversion can't see the entire characters.

Normally since "ob_iconv_handler" is advertized as a stream output handler, you'd think it handles this scenario correctly, alas it doesn't.

Output is correctly encoded when we give "complete" chunks to the handler. But if we cut down the buffer size so partial characters are sent in each chunk... we get garbled output.

This is especially troubling not only because it's not correct, but because this is supposedly the ONLY WAY to convert a stream from one charset to another. The only option in PHP right now, is to put an entire string in memory and convert it this way. If it doesn't fit in memory we're S.O.L.

There are many other issues with ob_iconv_handler - it relies on global settings, it has global state and is not reentrant... It'd be amazing if we had some sort of iconv_open iconv_read/write iconv_close API to handle these cases, but I digress...

Find the examples below reproducing the problem.

Test script:
---------------
--------------------------------------------------------------------------------
EXAMPLE 2:
--------------------------------------------------------------------------------

// Make sure display is right for browsers (also works in CLI if UTF8 is supported).
header('Content-Type: text/plain; charset=utf-8');

// UTF-8 sample text.
$t = 'Здравей!';
// We convert it to UTF-16LE, to then convert it back to UTF-8
$t = iconv('UTF-8', 'UTF-16LE', $t);

// We set-up ob_iconv_handler() to do UTF-16LE -> UTF-8 conversion
ini_set('internal_encoding', 'UTF-16LE');
ini_set('output_encoding', 'UTF-8');

// Prints "Здравей!" as expected.
ob_start('ob_iconv_handler', 4096);
for ($i = 0; $i<strlen($t); $i++) {
   echo $t[$i];
}
ob_end_flush();


--------------------------------------------------------------------------------
EXAMPLE 2:
--------------------------------------------------------------------------------

// Make sure display is right for browsers (also works in CLI if UTF8 is supported).
header('Content-Type: text/plain; charset=utf-8');

// UTF-8 sample text.
$t = 'Здравей!';
// We convert it to UTF-16LE, to then convert it back to UTF-8
$t = iconv('UTF-8', 'UTF-16LE', $t);

// We set-up ob_iconv_handler() to do UTF-16LE -> UTF-8 conversion
ini_set('internal_encoding', 'UTF-16LE');
ini_set('output_encoding', 'UTF-8');

// Prints GARBLED OUTPUT.
ob_start('ob_iconv_handler', 1);
for ($i = 0; $i<strlen($t); $i++) {
   echo $t[$i];
}
ob_end_flush();

Expected result:
----------------
Same correct output in both samples.

Actual result:
--------------
Garbled output in the second example.

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2017-08-28 14:49 UTC] tyzoid dot d at gmail dot com
Potential workaround using mb_strcut

---------
$mb_str="        0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F
U+1F60x ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
U+1F61x ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
U+1F62x ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
U+1F63x ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
U+1F64x ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
U+1F68x ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
U+1F69x ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
U+1F6Ax ? ? ? ? ?\n";

file_put_contents("emoji-16.txt", mb_convert_encoding($mb_str, 'UTF-16LE'));

$fh = fopen("emoji-16.txt", "r");

$input_format = 'UTF-16LE';
$databuf = "";
while (($databuf .= fread($fh, 1)) !== false && !feof($fh))
{
	// Get number of valid characters in the data buffer
	$str = mb_strcut($databuf, 0, null, $input_format);
	$databuf = substr($databuf, strlen($str));
	echo mb_convert_encoding($str, "UTF-8", $input_format);
}

if ($databuf)
{
	$str = mb_strcut($databuf, 0, null, $input_format);
	echo mb_convert_encoding($str, "UTF-8", $input_format);
}
----------
 [2018-08-26 16:51 UTC] cmb@php.net
-Status: Open +Status: Verified
 [2018-08-26 16:51 UTC] cmb@php.net
Confirmed.  The problem is that php_iconv_output_handler()[1]
ignores and forgets partially passed multibyte characters.

> The "ob_iconv_handler" is the only means in PHP to convert
> streaming input from one charset to another charset.

Not quite.  There are also the convert.iconv.* stream filters[2],
and mb_output_handler()[3].

[1] <https://github.com/php/php-src/blob/php-7.3.0beta2/ext/iconv/iconv.c#L394>
[2] <http://php.net/manual/en/filters.convert.php>
[3] <http://php.net/manual/en/function.mb-output-handler.php>
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sat Dec 21 18:01:29 2024 UTC