PHP :: Bug #81676 :: Converting string to UTF using \mb_convert

Bug #81676	Converting string to UTF using \mb_convert_encoding returns empty string
Submitted:	2021-11-30 12:10 UTC	Modified:	2021-12-05 17:35 UTC
From:	anton at gamee dot com	Assigned:	alexdowad (profile)
Status:	Duplicate	Package:	mbstring related
PHP Version:	8.1.0	OS:	MacOs
Private report:	No	CVE-ID:	None

View Developer Edit

[2021-11-30 12:10 UTC] anton at gamee dot com

Description:
------------
When using function \mb_convert_encoding() with third parameter \mb_list_encodings() function returns empty string.

Test script:
---------------
<?php

declare(strict_types=1);
$string = 'test';
$string = \mb_convert_encoding($string, 'UTF-8', \mb_list_encodings());
echo $string; // '' expecting 'test'

Expected result:
----------------
test

Actual result:
--------------
''

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2021-11-30 12:28 UTC] cmb@php.net

-Status: Open +Status: Duplicate -Package: Unknown/Other Function +Package: mbstring related -Assigned To: +Assigned To: cmb

[2021-11-30 12:28 UTC] cmb@php.net

Well, encoding detection is generally hard, if not impossible, and
there have been changes lately.  Basically, your problem boils
down to <https://3v4l.org/K2qsM>.

Anyhow, this ticket is a duplicate of bug #81390.

[2021-12-05 16:09 UTC] alexdowad@php.net

-Assigned To: cmb +Assigned To: alexdowad

[2021-12-05 16:09 UTC] alexdowad@php.net

Anton, thanks so much for discovering this problem.

PHP is mistakenly identifying the string 'test' as UCS-4 text. In UCS-4, it would be a single (bogus) codepoint U+74657374.

Part of the reason for this is because the new implementation of mb_detect_encoding (which is also used internally by mb_convert_encoding when encoding detection is needed) relies on the text encoding conversion code to signal errors if the input text is not valid in a certain encoding.

By design, the conversion code for UCS-4 doesn't treat codepoints over U+10FFFF as erroneous. That is the difference between UTF-32 and UCS-4; UTF-32 requires that all input characters should be within the legal range for Unicode codepoints, but UCS-4 allows anything.

I see now that this characteristic of UCS-4 interacts badly with the new encoding detection code. Will have to work on that.

[2021-12-05 17:22 UTC] alexdowad@php.net

OK, my analysis was a bit wrong. I checked again and found that the string is being detected as 'UUENCODE', not UCS-4. Duh.

This is the same issue which I addressed for mb_detect_encoding in a2bc57e0e5.

[2021-12-05 17:35 UTC] alexdowad@php.net

Please see the fix here:

https://github.com/php/php-src/pull/7659/commits/190710a8373b347519e0987cadb3777f6f0a0998

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2026 The PHP Group All rights reserved.	Last updated: Wed Jul 01 13:00:02 2026 UTC