php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #80689 Add support for incremental encoding conversion
Submitted: 2021-01-30 17:50 UTC Modified: 2021-01-30 18:26 UTC
From: dhammond at webdevout dot net Assigned:
Status: Open Package: mbstring related
PHP Version: 8.0.1 OS:
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: dhammond at webdevout dot net
New email:
PHP Version: OS:

 

 [2021-01-30 17:50 UTC] dhammond at webdevout dot net
Description:
------------
Mbstring currently supports converting a complete string of text from one encoding to another, but it doesn't yet support converting a stream of text incrementally. This is needed in streaming workflows that process one chunk of bytes at a time.

If you try to use mb_convert_encoding() in a streaming workflow, you run into problems with multibyte encodings:

1. The chunk might end in the middle of a multibyte sequence, resulting in corruption at the chunk boundaries.
2. Byte order detection in encodings like UTF-16 gets reset each chunk, meaning it might correctly interpret the first chunk as UTF-16LE and then incorrectly interpret the next chunk as UTF-16BE.
3. Some special encodings, like BASE64, have unique problems at chunk boundaries. In the case of BASE64 output encoding, if the input chunk is not a multiple of 3 bytes, then the chunk output will contain padding characters which should not exist in the middle of base64 data.

These problems would be resolved if we had a way to convert encodings incrementally. The mbstring module appears to support incremental conversion under the hood, but it doesn't yet expose any incremental API to userland. Here's an example of how such an API might look:

$context = mb_convert_init('UTF-8', 'UTF-16'); // To convert from UTF-16 to UTF-8.

while (!$source->feof())
{
  $input_chunk = $source->read(8192);
  $output_chunk = mb_convert_add($context, $input_chunk, false);
  $dest->write($output_chunk);
}

$output_chunk = mb_convert_add($context, '', true);
$dest->write($output_chunk);

In the above example, the third argument of mb_convert_add() is set to true for the final chunk, to indicate that it should finalize the stream and flush any buffers. In usages that are structured more like stream filters, it may be more common for this to be called like "$output_chunk = mb_convert_add($this->context, $input_chunk, $closing);", where the final call may contain input data that should be added before finalizing the stream.


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2021-01-30 18:26 UTC] cmb@php.net
Note that iconv conversion stream filters[1] are available.

[1] <https://www.php.net/manual/en/filters.convert.php#filters.convert.iconv>
 
PHP Copyright © 2001-2021 The PHP Group
All rights reserved.
Last updated: Sat Apr 10 15:01:23 2021 UTC