php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #81362 iconv UTF-8//IGNORE fails to strip "\xF5\x80\x80\x80"
Submitted: 2021-08-15 15:34 UTC Modified: 2021-08-16 12:44 UTC
From: divinity76 at gmail dot com Assigned:
Status: Open Package: ICONV related
PHP Version: 8.0.9 OS:
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: divinity76 at gmail dot com
New email:
PHP Version: OS:

 

 [2021-08-15 15:34 UTC] divinity76 at gmail dot com
Description:
------------
it does reproduce on "PHP8.0.3 + iconv 2.28 + Ubuntu 20.04",
it also reproduce on 3v4l.org everywhere from PHP5.6.0 to PHP8.0.9 inclusive, but unknown iconv/OS: https://3v4l.org/1b4G8

(the string also appears to trigger a bug in mb_check_encoding() that was fixed in 5.6.0?)

interestingly it does *not* reproduce on "PHP7.3.7-for-cygwin + iconv 1.0 + Windows 10", there it correctly strip the string down to string(3) "PHP"

it's possible that it's the result of a bug introduced sometime after iconv 1.0 and <= iconv 2.28?

Test script:
---------------
<?php
$invalid_utf8="\xF5\x80\x80\x80PHP";
$should_be_valid_utf8=iconv("UTF-8","UTF-8//IGNORE",$invalid_utf8);

var_dump($should_be_valid_utf8,bin2hex($should_be_valid_utf8),mb_check_encoding($should_be_valid_utf8));


Expected result:
----------------
string(3) "PHP"
string(6) "504850"
bool(true)


Actual result:
--------------
string(7) "����PHP"
string(14) "f5808080504850"
bool(false)


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2021-08-15 15:42 UTC] cmb@php.net
-Type: Bug +Type: Documentation Problem
 [2021-08-15 15:42 UTC] cmb@php.net
The latest GNU libiconv version is 1.16[1].  The iconv()
documentation[2] already warns about //TRANSLIT being
implementation specific; the same applies to //IGNORE.  I suggest
that you always build against GNU libiconv; system implementations
are known to be limited (or even buggy).

So no, not an implementation bug, but rather a doc issue.

[1] <https://www.gnu.org/software/libiconv/>
[2] <https://www.php.net/manual/en/function.iconv.php>
 [2021-08-15 16:02 UTC] divinity76 at gmail dot com
@cmb that's interesting, any idea where the deb.sury.org builds get their "iconv version 2.28" from?

but it's not just that they ignore the //IGNORE directive, they understand the "//IGNORE" directive and yet doesn't strip it. for example they all correctly strips "\xf5\x80\x80PHP" down to "PHP", so the //IGNORE isn't being ignored.
 [2021-08-16 12:44 UTC] nikic@php.net
Presumably iconv (or at least this iconv implementation) doesn't consider encoded codepoints > U+10FFFF to be invalid.
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Wed Apr 02 21:01:29 2025 UTC