php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #81676 Converting string to UTF using \mb_convert_encoding returns empty string
Submitted: 2021-11-30 12:10 UTC Modified: 2021-12-05 17:35 UTC
From: anton at gamee dot com Assigned: alexdowad (profile)
Status: Duplicate Package: mbstring related
PHP Version: 8.1.0 OS: MacOs
Private report: No CVE-ID: None
 [2021-11-30 12:10 UTC] anton at gamee dot com
Description:
------------
When using function \mb_convert_encoding() with third parameter \mb_list_encodings() function returns empty string.

Test script:
---------------
<?php

declare(strict_types=1);
$string = 'test';
$string = \mb_convert_encoding($string, 'UTF-8', \mb_list_encodings());
echo $string; // '' expecting 'test'

Expected result:
----------------
test

Actual result:
--------------
''

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2021-11-30 12:28 UTC] cmb@php.net
-Status: Open +Status: Duplicate -Package: Unknown/Other Function +Package: mbstring related -Assigned To: +Assigned To: cmb
 [2021-11-30 12:28 UTC] cmb@php.net
Well, encoding detection is generally hard, if not impossible, and
there have been changes lately.  Basically, your problem boils
down to <https://3v4l.org/K2qsM>.

Anyhow, this ticket is a duplicate of bug #81390.
 [2021-12-05 16:09 UTC] alexdowad@php.net
-Assigned To: cmb +Assigned To: alexdowad
 [2021-12-05 16:09 UTC] alexdowad@php.net
Anton, thanks so much for discovering this problem.

PHP is mistakenly identifying the string 'test' as UCS-4 text. In UCS-4, it would be a single (bogus) codepoint U+74657374.

Part of the reason for this is because the new implementation of mb_detect_encoding (which is also used internally by mb_convert_encoding when encoding detection is needed) relies on the text encoding conversion code to signal errors if the input text is not valid in a certain encoding.

By design, the conversion code for UCS-4 doesn't treat codepoints over U+10FFFF as erroneous. That is the difference between UTF-32 and UCS-4; UTF-32 requires that all input characters should be within the legal range for Unicode codepoints, but UCS-4 allows anything.

I see now that this characteristic of UCS-4 interacts badly with the new encoding detection code. Will have to work on that.
 [2021-12-05 17:22 UTC] alexdowad@php.net
OK, my analysis was a bit wrong. I checked again and found that the string is being detected as 'UUENCODE', not UCS-4. Duh.

This is the same issue which I addressed for mb_detect_encoding in a2bc57e0e5.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Wed Nov 06 20:01:29 2024 UTC