PHP :: Bug #81676 :: Converting string to UTF using \mb_convert

Bug #81676	Converting string to UTF using \mb_convert_encoding returns empty string
Submitted:	2021-11-30 12:10 UTC	Modified:	2021-12-05 17:35 UTC
From:	anton at gamee dot com	Assigned:	alexdowad (profile)
Status:	Duplicate	Package:	mbstring related
PHP Version:	8.1.0	OS:	MacOs
Private report:	No	CVE-ID:	None

View Developer Edit

Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.

Password:

Status:
Package:
Bug Type:
Summary:
From:	anton at gamee dot com
New email:
PHP Version:		OS:

New Comment:

[2021-11-30 12:10 UTC] anton at gamee dot com

Description:
------------
When using function \mb_convert_encoding() with third parameter \mb_list_encodings() function returns empty string.

Test script:
---------------
<?php

declare(strict_types=1);
$string = 'test';
$string = \mb_convert_encoding($string, 'UTF-8', \mb_list_encodings());
echo $string; // '' expecting 'test'

Expected result:
----------------
test

Actual result:
--------------
''

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2021-11-30 12:28 UTC] cmb@php.net

-Status: Open +Status: Duplicate -Package: Unknown/Other Function +Package: mbstring related -Assigned To: +Assigned To: cmb

[2021-11-30 12:28 UTC] cmb@php.net

Well, encoding detection is generally hard, if not impossible, and
there have been changes lately.  Basically, your problem boils
down to <https://3v4l.org/K2qsM>.

Anyhow, this ticket is a duplicate of bug #81390.

[2021-12-05 16:09 UTC] alexdowad@php.net

-Assigned To: cmb +Assigned To: alexdowad

[2021-12-05 16:09 UTC] alexdowad@php.net

Anton, thanks so much for discovering this problem.

PHP is mistakenly identifying the string 'test' as UCS-4 text. In UCS-4, it would be a single (bogus) codepoint U+74657374.

Part of the reason for this is because the new implementation of mb_detect_encoding (which is also used internally by mb_convert_encoding when encoding detection is needed) relies on the text encoding conversion code to signal errors if the input text is not valid in a certain encoding.

By design, the conversion code for UCS-4 doesn't treat codepoints over U+10FFFF as erroneous. That is the difference between UTF-32 and UCS-4; UTF-32 requires that all input characters should be within the legal range for Unicode codepoints, but UCS-4 allows anything.

I see now that this characteristic of UCS-4 interacts badly with the new encoding detection code. Will have to work on that.

[2021-12-05 17:22 UTC] alexdowad@php.net

OK, my analysis was a bit wrong. I checked again and found that the string is being detected as 'UUENCODE', not UCS-4. Duh.

This is the same issue which I addressed for mb_detect_encoding in a2bc57e0e5.

[2021-12-05 17:35 UTC] alexdowad@php.net

Please see the fix here:

https://github.com/php/php-src/pull/7659/commits/190710a8373b347519e0987cadb3777f6f0a0998

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Fri Jul 18 23:00:02 2025 UTC