php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #80689 Add support for incremental encoding conversion
Submitted: 2021-01-30 17:50 UTC Modified: 2021-01-30 18:26 UTC
Votes:1
Avg. Score:3.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:1 (100.0%)
Same OS:1 (100.0%)
From: dhammond at webdevout dot net Assigned:
Status: Open Package: mbstring related
PHP Version: 8.0.1 OS:
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: dhammond at webdevout dot net
New email:
PHP Version: OS:

 

 [2021-01-30 17:50 UTC] dhammond at webdevout dot net
Description:
------------
Mbstring currently supports converting a complete string of text from one encoding to another, but it doesn't yet support converting a stream of text incrementally. This is needed in streaming workflows that process one chunk of bytes at a time.

If you try to use mb_convert_encoding() in a streaming workflow, you run into problems with multibyte encodings:

1. The chunk might end in the middle of a multibyte sequence, resulting in corruption at the chunk boundaries.
2. Byte order detection in encodings like UTF-16 gets reset each chunk, meaning it might correctly interpret the first chunk as UTF-16LE and then incorrectly interpret the next chunk as UTF-16BE.
3. Some special encodings, like BASE64, have unique problems at chunk boundaries. In the case of BASE64 output encoding, if the input chunk is not a multiple of 3 bytes, then the chunk output will contain padding characters which should not exist in the middle of base64 data.

These problems would be resolved if we had a way to convert encodings incrementally. The mbstring module appears to support incremental conversion under the hood, but it doesn't yet expose any incremental API to userland. Here's an example of how such an API might look:

$context = mb_convert_init('UTF-8', 'UTF-16'); // To convert from UTF-16 to UTF-8.

while (!$source->feof())
{
  $input_chunk = $source->read(8192);
  $output_chunk = mb_convert_add($context, $input_chunk, false);
  $dest->write($output_chunk);
}

$output_chunk = mb_convert_add($context, '', true);
$dest->write($output_chunk);

In the above example, the third argument of mb_convert_add() is set to true for the final chunk, to indicate that it should finalize the stream and flush any buffers. In usages that are structured more like stream filters, it may be more common for this to be called like "$output_chunk = mb_convert_add($this->context, $input_chunk, $closing);", where the final call may contain input data that should be added before finalizing the stream.


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2021-01-30 18:26 UTC] cmb@php.net
Note that iconv conversion stream filters[1] are available.

[1] <https://www.php.net/manual/en/filters.convert.php#filters.convert.iconv>
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Dec 27 01:01:28 2024 UTC