php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #81676 Converting string to UTF using \mb_convert_encoding returns empty string
Submitted: 2021-11-30 12:10 UTC Modified: 2021-12-05 17:35 UTC
From: anton at gamee dot com Assigned: alexdowad (profile)
Status: Duplicate Package: mbstring related
PHP Version: 8.1.0 OS: MacOs
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: anton at gamee dot com
New email:
PHP Version: OS:

 

 [2021-11-30 12:10 UTC] anton at gamee dot com
Description:
------------
When using function \mb_convert_encoding() with third parameter \mb_list_encodings() function returns empty string.

Test script:
---------------
<?php

declare(strict_types=1);
$string = 'test';
$string = \mb_convert_encoding($string, 'UTF-8', \mb_list_encodings());
echo $string; // '' expecting 'test'

Expected result:
----------------
test

Actual result:
--------------
''

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2021-11-30 12:28 UTC] cmb@php.net
-Status: Open +Status: Duplicate -Package: Unknown/Other Function +Package: mbstring related -Assigned To: +Assigned To: cmb
 [2021-11-30 12:28 UTC] cmb@php.net
Well, encoding detection is generally hard, if not impossible, and
there have been changes lately.  Basically, your problem boils
down to <https://3v4l.org/K2qsM>.

Anyhow, this ticket is a duplicate of bug #81390.
 [2021-12-05 16:09 UTC] alexdowad@php.net
-Assigned To: cmb +Assigned To: alexdowad
 [2021-12-05 16:09 UTC] alexdowad@php.net
Anton, thanks so much for discovering this problem.

PHP is mistakenly identifying the string 'test' as UCS-4 text. In UCS-4, it would be a single (bogus) codepoint U+74657374.

Part of the reason for this is because the new implementation of mb_detect_encoding (which is also used internally by mb_convert_encoding when encoding detection is needed) relies on the text encoding conversion code to signal errors if the input text is not valid in a certain encoding.

By design, the conversion code for UCS-4 doesn't treat codepoints over U+10FFFF as erroneous. That is the difference between UTF-32 and UCS-4; UTF-32 requires that all input characters should be within the legal range for Unicode codepoints, but UCS-4 allows anything.

I see now that this characteristic of UCS-4 interacts badly with the new encoding detection code. Will have to work on that.
 [2021-12-05 17:22 UTC] alexdowad@php.net
OK, my analysis was a bit wrong. I checked again and found that the string is being detected as 'UUENCODE', not UCS-4. Duh.

This is the same issue which I addressed for mb_detect_encoding in a2bc57e0e5.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sun Oct 27 16:01:27 2024 UTC